02 Jun 2017 To Partition or not to Partition
I have been using BigQuery for over 2 years now at Shine. I’ve found it to be a great tool that is both incredibly fast and able to handle some of our largest workloads. We are processing terabytes of data per day, and each day an extra billion records are added to the store.
But unfortunately this growth is also increasing our costs of running queries. While BigQuery is extremely fast and parallel, it comes at the cost of needing to scan and pay for every record of the columns you are querying. Without the indexes offered by conventional databases, a full table scan is needed for each query. Not only that but when you query large amounts of data the speed of your query slows down:In this post I’ll talk about how we used table partitions to increase the performance of our queries and avoid query slowdowns.
Last year a new BigQuery feature was released which allows you to partition a table based on time. This was a great option for us, as all of our data has a timestamp column, and most of our queries are selecting data between two dates that are close together.
Partitioned tables will give us a huge advantage when running our statistical model due to our need to often overwrite the records for a day’s data. In the past with BigQuery’s insert-only model, we had to run a full-table query that excluded the day we want to delete, and then overwrite the table with the query result by doing a
write truncate job. We’d then append the updated data for the day. This proved expensive and time-consuming. Now with partitioned tables, we could simply update the partition by doing a
write truncate on the partition for the day we want to update.
So by partitioning based on date and querying data based on the partition-date, we could reduce the cost of running queries significantly. Now, all we needed to do was partition our tables and we’d start saving money. Simple, right?
Not So Simple
The catch is that BigQuery does not provide a simple way to partition a table based on an existing timestamp. It only allows you to create a partitioned table based on a date-shared table. Unfortunately, none of our tables are date-shared, so we wouldn’t be able to create a partitioned table in this way.
So how else could we create a date-partitioned table? Well, the simplest way would be to write a script which creates an empty date-partitioned table, then runs a query to get the data for each and every day, and write the result back to the correct partition.
The problem with this approach is its enormous cost. As my colleague Gareth pointed out in this blog, you need to do a full table scan for each day in your table. So a 2TB table which contains 2 years of data would require 730 table scans, which amounts to about 1.4PB of data scanned, which in-turn would end up costing US$7,300 – just to partition one table. Not all of our tables are as big as this, but it was clearly not going to be a viable option due to the cost.
Maybe Google Can offer some help?
In search of a better was of partitioning our tables, we reached out to our friends at Google and ended up in touch with the BigQuery engineering team. We asked them if there was another option for date-partitioning our tables. They agreed that the only option was for us to query the tables multiple times and writing the result back to the partitions. So it sounded like the task was either going to cost us a fortune or we would need to find another solution on our own.
How about Dataflow?
With Google yet to offer us a solution, we decided to explore other options. We thought about using Google Cloud Dataflow to do a single table scan and write each record to the correct partition by using side outputs to write to each table partition. Put differently, we would need to write a Dataflow pipeline which would read in the data from an unpartitioned table and then write it into the correct partitions of a partitioned table.
With this approach, we can ensure that only one table scan is completed for each table we need to partition. However, we quickly realised that using any significant number of side-outputs would cause the Dataflow pipeline to crash. The only possible way I could see this working was to split the job into multiple small pipelines where each pipeline would read in the same table and handle a few partitions. This, however, would be really slow and also expensive.
Stack Overflow Can Help?
Next, we came across someone on Stack Overflow who was also having this problem. The accepted answer suggested that we should create an intermediate pivot table. This would put each day into its own column, with each day containing a leaf field for each column. With each day now being a column, we could now query a day worth of data, flatten it and write it to a partition. So the result would be two table scans to produce a partitioned table. This initially seemed like a promising solution to our problem, however, we hit a roadblock:
This was quite annoying. We seemed to have reached a limit on the number of leaf fields. A leaf field is effectively a column inside a column (636 days x 24 columns). So, once we figured out that the limit is actually 10,000, the solution was simple: half the number of leaf fields and create the pivot table in two parts. Not ideal, but it was something that we could live with. But sadly, that didn’t work either:
So there appeared to be a 100MB limit per row in BigQuery, which is less than a day’s data. At this point we realised we would be unable to partition our tables in this way. So we were back to looking for a way to partition the tables again.
Could there be another way?
We reached out to the Google Cloud Dataflow team to see if they had any other ideas for how could partition the tables. We were given a code sample that we could use, the only issue being that it wouldn’t work for more than 200 days. However, it meant that we could partition each table using only 4 scans instead of more than 600, making us significant cost saving. Now, all we need to do was adapt the code sample to go through all of our tables and partition them.
So we had a solution, what now?
So we now had a solution that allowed us to use Cloud Dataflow to partition the tables.
However, there was one final challenge to overcome. We found that when copying a partitioned table using
write truncate to replace a non-partitioned table, it would copy the data in the table but not the actual partitions. So we would be left with an unpartitioned table, which is completely pointless. The only way we could find to get around this was to delete the table in production and then copy the partitioned table version into its place.
This was not ideal because our partitioned tables would take up to 20 minutes to copy across, during which time we would not have any tables in production. In the end we just had to do the release after business hours, but it would have been nice to have a way to make the change without any downtime.
With Shine being a Google consulting partner, it’s been great to have access to people in Google’s engineering teams who know a lot about how these products work, and to be able to work with them to find solutions to our problems.
While our final solution for turning existing tables into partitioned tables was not optimal, word is that there will be an upcoming feature which will allow you to do this much more easily. Unfortunately, we weren’t able to wait, as our costs were increasing each month. The good news was that, with a bit of help and ingenuity, we were able to get the result we needed in the end.