In my previous job, I had a basic grasp of writing a SQL query, but I was never quite comfortable with “advanced” queries. (By “advanced”, it’s more like intermediate at best — it’s the nuances of joins, group-bys, having vs. where.) I was told that whatever SQL I didn’t know would be “easy” to pick up and would happen naturally, although in practice that never quite happened. It wasn’t until I started to come up with a system for solving interview-style programming problems that I started to similarly come up with a system for writing any SQL query. The following is the result, which is less of a “tutorial” for “beginner SQL” and more of a systematic process for constructing a SQL query:
Here are my sequential steps for writing SQL queries in a somewhat methodical way. YMMV.
Note: ‘key’, ‘column’, and ‘field’ are used interchangeably.
For the example queries, I will reference the public tables available in GCP BigQuery (disclaimer: I’m a Google employee). BQ is convenient for providing a common yet tangible example here, while also being really good for many ad-hoc / OLAP-style use cases. The SQL will be in BigQuery’s Legacy SQL syntax.
The tables used are a table with population and gender information in the US in 2010 by zip code (`bigquery-public-data.census_bureau_usa.population_by_zip_2010`) and a table that gives basic geographic attributes of US zip code regions (`bigquery-public-data.utility_us.zipcode_area`).
For the examples below, let us say that we are interested in finding out certain types of population information by state.
Step # | Description | Example |
---|---|---|
1 | Determine which tables you will select from | We have
population_by_zip_2010 giving population information by zip code, and zipcode_area giving geo-info by zip code |
2 | If there are joins required between tables, determine:
| Let us say that we want to find the number of men and women per state.
To do that, we need to join the two tables (population by zip) and (state/geo-info by zip) together, with zip code being the obvious key to use to join them. In the population by zip table, the zip code field is called “zipcode”, and in the state/geo-info by zip table, the zip code field is also called “zipcode”. For now, we have just a query fragment: ... FROM [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP JOIN [bigquery-public-data:utility_us.zipcode_area] GEO ON POP.zipcode = GEO.zipcode ... |
3 | Determine:
| Let’s say we filter the results to only see zip codes in the state of North Carolina. We can add a WHERE clause:
... FROM [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP JOIN [bigquery-public-data:utility_us.zipcode_area] GEO ON POP.zipcode = GEO.zipcode WHERE GEO.state_code = 'NC' ... Let’s simultaneously consider another question — how many people are there in each state? There are multiple zip codes that belong to the same state. We want to gather all of the rows (zip codes) belonging to the same state and look at them together, and do this for each unique state. That’s what GROUP BY is for. The arguments to GROUP BY are columns where each unique combination of values of those columns constitutes a new grouping. In our case, we will only provide state_code as an argument, so we look at groups for each unique value of state_code: ... FROM [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP JOIN [bigquery-public-data:utility_us.zipcode_area] GEO ON POP.zipcode = GEO.zipcode GROUP BY GEO.state_code ... |
4 | Determine:
| Let’s say we want to list all of the information for North Carolina zip codes ordered by the zip code, and within the same zip code, ordered by the name of the gender in the population data row:
... FROM [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP JOIN [bigquery-public-data:utility_us.zipcode_area] GEO ON POP.zipcode = GEO.zipcode WHERE GEO.state_code = 'NC' ORDER BY GEO.zipcode, POP.gender ... In our other example, let’s imagine that we will eventually sum up the population within each state for both genders, but we only want to keep those states & their total populations if that total population exceeds 10 million: ... FROM [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP JOIN [bigquery-public-data:utility_us.zipcode_area] GEO ON POP.zipcode = GEO.zipcode GROUP BY GEO.state_code HAVING total_pop > 10000000 ... |
5 | Determine which columns go in the SELECT portion of the query. Any columns that are not used by GROUP BY to form groups should have an aggregator function applied to them. | What fields do we want from North Carolina zip codes? Let’s say, the zip code, gender, population for that gender,
SELECT POP.zipcode AS zipcode, POP.gender AS gender, POP.population AS zip_gender_pop, GEO.state_code AS state_code, GEO.city AS city FROM [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP JOIN [bigquery-public-data:utility_us.zipcode_area] GEO ON POP.zipcode = GEO.zipcode WHERE GEO.state_code = 'NC' ORDER BY GEO.zipcode, POP.gender And in our other example, we want to find the total population across both genders and across the multiple entries even for the same gender within a single zip code. So we have to apply an aggregator function to the population field in order to achieve that result. Since the whole point of GROUP BY is to enable rollups, SELECT GEO.state_code, SUM(POP.population) AS total_pop FROM [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP JOIN [bigquery-public-data:utility_us.zipcode_area] GEO ON POP.zipcode = GEO.zipcode GROUP BY GEO.state_code HAVING total_pop > 10000000 |
In SQL, each of the different clauses of the entire statement have to go in a particular order. But that’s easy to look up, so I did not spend time above covering that topic.
In BigQuery in particular, for basic operations, there is not much of a difference between its Standard SQL syntax and Legacy SQL syntax except that tables in Legacy SQL are written like [<bq-project>:<bq-proj-dataset>.<bq-proj-dataset-tablename>] but tables in Standard SQL are written like `<bq-project>.<bq-proj-dataset>.<bq-proj-dataset-tablename>`.
In BigQuery, you may write a query that results in too much data for BigQuery’s UI to comfortably hold onto for display purposes. One way to resolve that is to find ways to aggregate/rollup the data into a coarser granularity, so there will be fewer rows in the output, and thus BQ should be able to optimize some of its work that will allow the results to be displayed in the console/UI. But if storing that large output is necessary, or usefully avoids repeating long, costly work, then you can store those large results into a new table using the BQ command line tool.