Categories
general programming

Tips for Writing SQL Queries

In my previous job, I had a basic grasp of writing a SQL query, but I was never quite comfortable with “advanced” queries. (By “advanced”, it’s more like intermediate at best — it’s the nuances of joins, group-bys, having vs. where.) I was told that whatever SQL I didn’t know would be “easy” to pick up and would happen naturally, although in practice that never quite happened. It wasn’t until I started to come up with a system for solving interview-style programming problems that I started to similarly come up with a system for writing any SQL query. The following is the result, which is less of a “tutorial” for “beginner SQL” and more of a systematic process for constructing a SQL query:

Here are my sequential steps for writing SQL queries in a somewhat methodical way. YMMV.
Note: ‘key’, ‘column’, and ‘field’ are used interchangeably.

For the example queries, I will reference the public tables available in GCP BigQuery (disclaimer: I’m a Google employee). BQ is convenient for providing a common yet tangible example here, while also being really good for many ad-hoc / OLAP-style use cases. The SQL will be in BigQuery’s Legacy SQL syntax.

The tables used are a table with population and gender information in the US in 2010 by zip code (`bigquery-public-data.census_bureau_usa.population_by_zip_2010`) and a table that gives basic geographic attributes of US zip code regions (`bigquery-public-data.utility_us.zipcode_area`).

For the examples below, let us say that we are interested in finding out certain types of population information by state.

Step #DescriptionExample
1Determine which tables you will select fromWe have

 

population_by_zip_2010

giving population information by zip code, and

zipcode_area

giving geo-info by zip code

2If there are joins required between tables, determine:

 

  • What type of join is appropriate
  • Which key(s) to join on
Let us say that we want to find the number of men and women per state.

 

To do that, we need to join the two tables (population by zip) and (state/geo-info by zip) together, with zip code being the obvious key to use to join them. In the population by zip table, the zip code field is called “zipcode”, and in the state/geo-info by zip table, the zip code field is also called “zipcode”. For now, we have just a query fragment:

...
FROM
  [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP
JOIN
  [bigquery-public-data:utility_us.zipcode_area] GEO
ON
  POP.zipcode = GEO.zipcode
...
3Determine:

 

  • Any initial filtering of tables through WHERE
  • If aggregation required thereafter, which key(s) to use in forming groups/buckets in GROUP BY
Let’s say we filter the results to only see zip codes in the state of North Carolina. We can add a WHERE clause:

 

...
FROM
  [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP
JOIN
  [bigquery-public-data:utility_us.zipcode_area] GEO
ON
  POP.zipcode = GEO.zipcode
WHERE
  GEO.state_code = 'NC'
...

Let’s simultaneously consider another question — how many people are there in each state? There are multiple zip codes that belong to the same state. We want to gather all of the rows (zip codes) belonging to the same state and look at them together, and do this for each unique state. That’s what GROUP BY is for. The arguments to GROUP BY are columns where each unique combination of values of those columns constitutes a new grouping. In our case, we will only provide state_code as an argument, so we look at groups for each unique value of state_code:

...
FROM
  [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP
JOIN
  [bigquery-public-data:utility_us.zipcode_area] GEO
ON
  POP.zipcode = GEO.zipcode
GROUP BY
  GEO.state_code
...
4Determine:

 

  • To order/sort the results, use ORDER BY
  • To filter based on values after aggregation, use HAVING
Let’s say we want to list all of the information for North Carolina zip codes ordered by the zip code, and within the same zip code, ordered by the name of the gender in the population data row:

 

...
FROM
  [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP
JOIN
  [bigquery-public-data:utility_us.zipcode_area] GEO
ON
  POP.zipcode = GEO.zipcode
WHERE
  GEO.state_code = 'NC'
ORDER BY
  GEO.zipcode, POP.gender
...

In our other example, let’s imagine that we will eventually sum up the population within each state for both genders, but we only want to keep those states & their total populations if that total population exceeds 10 million:

...
FROM
  [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP
JOIN
  [bigquery-public-data:utility_us.zipcode_area] GEO
ON
  POP.zipcode = GEO.zipcode
GROUP BY
  GEO.state_code
HAVING
  total_pop > 10000000
...
5Determine which columns go in the SELECT portion of the query. Any columns that are not used by GROUP BY to form groups should have an aggregator function applied to them.

What fields do we want from North Carolina zip codes? Let’s say, the zip code, gender, population for that gender,
the state code (just to reassure ourselves we only see ‘NC’). And maybe, because we can, let’s include city.

 

SELECT
   POP.zipcode AS zipcode,
   POP.gender AS gender,
   POP.population AS zip_gender_pop,
   GEO.state_code AS state_code,
   GEO.city AS city
 FROM
   [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP
 JOIN
   [bigquery-public-data:utility_us.zipcode_area] GEO
 ON
   POP.zipcode = GEO.zipcode
 WHERE
   GEO.state_code = 'NC'
 ORDER BY
   GEO.zipcode, POP.gender

And in our other example, we want to find the total population across both genders and across the multiple entries even for the same gender within a single zip code. So we have to apply an aggregator function to the population field in order to achieve that result. Since the whole point of GROUP BY is to enable rollups,
some columns are used to determine the grouping, and the remaining columns may take on different values in the rows belonging to each group. We have to end up with just one row (because we’re rolling up! — that is, we’re defining how we aggregate a lot of data points into fewer points), so an aggregation function has to be applied. The one we care about is SUM. But SQL prevents you from getting mixed up — any columns used for defining groups cannot be used as arguments in the aggregator functions.

SELECT
   GEO.state_code,
   SUM(POP.population) AS total_pop
 FROM
   [bigquery-public-data:census_bureau_usa.population_by_zip_2010] POP
 JOIN
   [bigquery-public-data:utility_us.zipcode_area] GEO
 ON
   POP.zipcode = GEO.zipcode
 GROUP BY
   GEO.state_code
 HAVING
   total_pop > 10000000

In SQL, each of the different clauses of the entire statement have to go in a particular order. But that’s easy to look up, so I did not spend time above covering that topic.

In BigQuery in particular, for basic operations, there is not much of a difference between its Standard SQL syntax and Legacy SQL syntax except that tables in Legacy SQL are written like [<bq-project>:<bq-proj-dataset>.<bq-proj-dataset-tablename>] but tables in Standard SQL are written like `<bq-project>.<bq-proj-dataset>.<bq-proj-dataset-tablename>`.

In BigQuery, you may write a query that results in too much data for BigQuery’s UI to comfortably hold onto for display purposes. One way to resolve that is to find ways to aggregate/rollup the data into a coarser granularity, so there will be fewer rows in the output, and thus BQ should be able to optimize some of its work that will allow the results to be displayed in the console/UI. But if storing that large output is necessary, or usefully avoids repeating long, costly work, then you can store those large results into a new table using the BQ command line tool.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s