09 Apr 2018 Introducing column based partitioning in BigQuery
When we started using Google BigQuery – almost five years ago now – it didn’t have any partitioning functionality built into it. Heck, queries cost $20 p/TB back then too for goodness’ sake! To compensate for this lack of functionality and to save costs, we had to manually shard our tables using the well known
_YYYYMMDD suffix pattern just like everyone else. This works fine, but it’s quite cumbersome, has some hard limits, and your SQL can quickly becomes unruly.
Then about a year ago, the BigQuery team released ingestion time partitioning. This allowed users to partition tables based on the load/arrival time of the data, or by explicitly stating the partition to load the data into (using the
$ syntax). By using the
_PARTITIONTIME pseudo-column, users were more easily able to craft their SQL, and save costs by only addressing the necessary partition(s). It was a major milestone for the BigQuery engineering team, and we were quick to adopt it into our data pipelines. We rejoiced and gave each other a lot of high-fives.
However, as welcoming as this new feature was at the time, it still came with a few obstacles. Firstly, there was no easy way to convert an exisiting non-sharded suffix’y table (yup, sadly we do have a few of them in our projects) to a ingestion based partitioned table. See this awesome Stack Overflow thread for more details on that!
Secondly, because it was based on ingestion/load time, we still needed to do some work when wrangling our datasets to figure out the correct partition to load the data into. And when you’re working across multiple timezones this can be extremely challenging. To help us manage this, we became heavily dependant on Cloud Dataflow and it soon became a vital component in our data workloads.
The year is 2018
A few months ago, the BigQuery team released this. Can’t be arsed reading it? No worries, I’ve already done it for you. What this means is that you can now partition a table by using a column in its schema, and not just by the ingestion time. So, no more queries using
_PARTITIONTIME and having to work out the correct partition to load into. What a time to be alive folks!
Note: this feature is still in beta. If that’s OK with you, then read on.
The team were immediately excited by this release. We even made the leap from high-fives to hugs. OK, so you still can’t partition by a non-data/timestamp column (yet), but it’s still great improvement. As per the docs, you can create a column based partitioned table by either:
- Using a DDL
CREATE TABLEstatement with a
- Manually using the command line tool’s
- Programmatically by calling the
- From query results
- When you load data
For example, using the CLI tool (#2), it’s a doddle:
bq mk --table --schema foo:STRING,bar:TIMESTAMP --time_partitioning_field bar --description "this is a shiny new partitioned table" grey-sort-challenge:partitioning_magic.foobarred
Et voilà ! I now have a beautifully partitioned table, and instead of using
_PARTITIONTIME in my queries, I can now simply use
bar as the partition. Yes, it’s a trivial example, but it succinctly highlights the new feature:
bq show grey-sort-challenge:partitioning_magic.foobarred
But, it’s a pretty useless table without some data in it, so let’s fix that:
bq query --destination_table grey-sort-challenge:partitioning_magic.foobarred --use_legacy_sql=false --append_table "$(cat data.sql)"
With a massive 3 rows of data to play with, I can now use the
bar column to only read from the partition(s) I’m interested in:
bq query --use_legacy_sql=false "select * from \`grey-sort-challenge.partitioning_magic.foobarred\` where bar >= '2018-04-14'"
Partitioning a standard table
That’s cool and all, but what if you have a standard table (not one that’s manually sharded using the
_YYYYMMDD suffix) and you want to convert it into a partitioned table? This is where it gets interesting folks. Whizz back up the list of ways to create a partitioned table I mentioned above. See #4 (“From query results“). This means we can now query an existing table and write it back to a newly partitioned table. It will cost you one full table scan. Using the public dataset of NYC taxi trips here’s a walkthrough of how you do it.
I first start by copying the public table over to my own project:
bq cp nyc-tlc:yellow.trips grey-sort-challenge:partitioning_magic.nyc_taxi_trips
Next, I smash it with some 101 SQL, specify the column I want to use for my partition by using the
--time_partitioning_field parameter (I’m using
pickup_datetime in this case), and finally write the results to a new table (it must be a new table according to the docs):
bq query --destination_table grey-sort-challenge:partitioning_magic.nyc_taxi_trips_partitioned --time_partitioning_field pickup_datetime --use_legacy_sql=false "select * from \`grey-sort-challenge.partitioning_magic.nyc_taxi_trips\`"
And there you have….oh wait..
Error in query string: Error processing job ‘grey-sort-challenge:bqjob_r37b076ef0d3c4338_000001626c539e6a_1’: Too many partitions produced by query, allowed 2000, query produces at least 2372 partitions
Know your limits
Why did it fail? Isn’t the number of allowed partitions 2500 according to the docs? And if it’s 2500, why is the error message spitting out the number 2000!?
Scratching my head, and not knowing what was going on here, I turned to Stack Overflow to get it all cleared up by the Google boffins (tip: the BigQuery engineers are extremely active on Stack Overflow, and generously give up their time to answer [my] silly questions). Within a few hours I had my answer:
It turns out that in addition to BigQuery’s 2500 max number of partitions in a table, there is even a stricter limit on the number of partitions you can create in a single operation. Remember, I tried creating a partitioned table in one shot using a query, so I hit the 2000 limit.
I’m happy this happened during this post because it highlights something that you need to be very wary of when working with BigQuery. It does have quite a number of limits and quotas. I understand why they are imposed (i.e. it’s a multi-tenant system), but there are now so many it’s quite easy to forget about them, and you can suddenly find yourself walking face first into them. In my opinion, some of them are just too restrictive, such as this one.
Anyway, in order to work around this limit, I worked out the max
pickup_datetime and subtracted roughly 5 years (~2000 partitions/365 days) worth of data off that and blew away the table with the remaining 5 years worth of data:
Partitioning a standard table (take 2)
Using my dodgy math, I had calculated the table should now have about 5 years worth of data. As such, I hoped I would not hit the 2000 max partitions created in a single operation limit the second time around. I ran the same command as previously, and held my breath..
88 seconds later I got back a “DONE” status. It worked! Taking a look at the table, we can see it’s now a partitioned table (by column
pickup_datetime) and it has 854M rows in it:
Column based partitioning is certainly a welcome addition to BigQuery’s toolkit. It took a while to get there, but it’s great to have it as another option instead of just ingestion based partitioned tables. That said however, it still has a way to go yet in terms of functionality e.g. partitioning by non date/time fields, partition by multiple columns, and lifting those pesky limits.
I also feel it’s fair to say that partitioning in BigQuery can be a little confusing for new users to navigate given its history and multiple implementations. But, hopefully by reading this post it will be somewhat clearer to you. Either that, or I’ve done the opposite and made it even more confusing. I’m sorry if that’s true. I truly am.
Finally, there’s one last thing I need to call out: you can only use Legacy SQL to read partition meta data. Yes, you read that right. See the docs here:
Currently, standard SQL does not support the partition decorator separator (
$) so you cannot query
__PARTITIONS_SUMMARY__in standard SQL
It’s a bit bonkers that you can only use LegacySQL to query a partitioned table for its meta data. I know Google are working on fixing it, but for now I’m going to put that one in the WTF bucket 😉
Ben VoganPosted at 03:53h, 12 April
One of the nice things about using the existing partitioning scheme is that you could specify on a query that you wanted to replace an entire partition using the dataset.tablename$YYYYMMDD table targetting mechanism. Is this still possible with field-based partitioning?
So for example there was a problem in the system on 2018-04-01 and I need to regenerate all of the data for that date. Using field-based partitioning do I have to issue a DELETE followed by INSERTS or can i replace the entire partition atomically like I can with the older partitions?
Kiril (@KAleksovski)Posted at 22:54h, 13 April
Can we bypass a partition limit by loading 2000 partitions wort of data in one single operation and then the rest 500 in the second operation?
Rahul VermaPosted at 22:27h, 19 December
Is there a way to partition a Bigquery table using multiple columns?