At Shine we’re big fans of Google BigQuery, which is their flagship big data processing SaaS. Load in your data of any size, write some SQL, and smash through datasets in mere seconds. We love it. It’s the one true zero-ops model that we’re aware of for grinding through big data without the headache of worrying about any infrastructure. It also scales to petabytes. Although we’ve only got terabytes, but you’ve got to start somewhere right?
If you haven’t yet been introduced to the wonderful world of BigQuery, then I suggest you take some time right after this reading this post to go and check it out. Your first 1TB is free anyway. Bargain!
Anyway, back to the point of this post. There have been a lot of updates to BigQuery in recent months, both internally and via features, and I wanted to capture them all in a concise blog post. I won’t go into great detail on each of them, but rather give a quick summary of each, which will hopefully give readers a good overview of what’s been happening with the big Q lately. I’ve pulled together a lot of this stuff from various Google blog posts, videos, and announcements at GCP Next 2016 etc.
Capacitor (faster queries)
BigQuery was already fast. Ridiculously fast in fact. But now it’s even faster! Up until 2015, BigQuery’s storage engine was ColumnIO. In other words, unlike a traditional DB, data is stored on Google disks in columnar format rather than in row format. That makes scanning data much quicker. It also allows for much better compression too.
Google claim that BigQuery’s brand new storage engine, Capacitor, can accelerate many queries by up to 10x, and in some cases up to 1000x. That’s kind of a big deal! The biggest change appears to be that Capacitor can operate directly on compressed data when querying rather than first having to decompress it. It’s obvious that alone will be a massive boost to performance straight off the bat.
We haven’t done any official benchmarking of our biggest queries yet, but we have noticed BigQuery seems much zippier overall when querying, in particular when we use filters i.e. WHERE clauses. We also love how Google just roll out changes like these under the hood, and we’re none the wiser. We don’t experience any downtime, nor do we have to do any upgrades. It just works, and it just gets faster. Brilliant.
DremelX (faster complex queries)
BigQuery is Dremel. There’s no real difference, except for the fact that internally Google use the name Dremel. It’s the same engineering team, and even the same code base. Dremel is mission critical at Google, and they use it across all business units for their own data analysis. In the words of one Dremel engineer that I had the pleasure in meeting last year: “Dremel is just a humongous query engine that doesn’t care about data size. It will churn through anything you throw at it“.
Under the hood, BigQuery has a tree serving architecture. That means when your query comes into the processing engine it is split up into smaller pieces, each one working on the data that it knows about, and then re-assembles the results the back to you. Your query can run on thousands of cores (Compute Engine instances/VMs) without you even realising it.
Before DremelX, the way in which queries were distributed through the tree were somewhat static. Data would come in at the root level of the tree, and move up and down a long-lived serving tree. Data would be sent off to shards, partially aggregated, and then move back up to the mixers. Rinse and repeat until the query was finished. It’s worth mentioning at this point that all queries (i.e. shuffling) happen completely in memory – no matter of the size of query or data! Woohaa!
DremelX now builds a dynamic custom serving tree just for your query instead. It’s also tailor sized just for that query. That’s awfully nice of Google now isn’t it! That means that the data no longer needs to flow up and down the tree i.e. no more multiple passes to the shards. BigQuery essentially looks at the complexity of your query (number of joins, grouping etc.) when it comes in, and then builds each level/stage in the serving tree on the fly for optimal performance. That’s seriously impressive, and I can only imagine the work and complexity of the software that has gone into making that happen.
Poseidon (faster import/export)
BigQuery can already bulk load your data in pretty quickly. We’ve seen bulk load times of 1TB+ in under 20 minutes. in some of our projects. But for Google, it wasn’t fast enough! They’ve built some brand spanking new import/export pipelines that use the same underlying Dremel infrastructure leveraged by queries. They are claiming Poseidon will give users improvements of up to 5x for ingestion with no degradation to query performance. To be honest, we’ve not seen a massive uplift in ingestion times. But that’s probably because we use Dataflow for a lot of our ETL to/from BigQuery, and Dataflow already crazy fast – and now even faster.
Partitioned tables (alpha)
In the past, we would have to manually shard our data across daily tables. This allowed us to scan only the period we were interested in, instead of paying for a full table scan on all our data. However, there are several disadvantages to this approach.
- Creating the tables was manual
- Performance degrades as the number of referenced tables increases.
- There is a limit of 1000 tables that can be referenced in a single query.
Partitioned tables now address these issues. It’s automatic, performance will not degrade, and querying will be cheaper because you can now slice the data again by day. Simply load your data in, and Google will handle the daily sharding for you. We’ve now started to migrate our existing tables to date partitioned ones. But, as we found out pretty quickly, converting existing tables is not as easy (or cheap) as it should be.
Standard SQL (beta) & insert, update and delete (alpha)
We’ve been using BigQuery since 2012. Since then, we’ve actually become accustomed to its slightly quirky SQL syntax. So much so, that when I was asked recently to do some basic querying in Oracle, I simply couldn’t remember normal SQL syntax anymore!
Google have now released BigQuery with standard SQL (beta at time of writing), which is compliant with the SQL 2011 standard and has extensions that support querying nested and repeated data. It’s a wise move from them, and makes life a lot easier when you need to jump between different DBs. I suspect that it will also entice more users over to BigQuery because they will be more comfortable now.
In addition to standard SQL, Google have also made a bold move. They introduced the ability to insert, update, and delete. This is massive news. BigQuery’s achilles heel has always, for us at least, being that it was append only. You could not update or delete records in your tables. We often found ourselves applying really ugly workarounds to update and delete. It was very messy indeed. Now however, according to this SO post, you can now update and delete – but only using standard SQL. At time of writing it’s in alpha.
One of the awesome things about BigQuery is that you only pay for what you query (and store, but more on that later). Currently that’s $5 per TB queried. You don not have to pay for instances running 24/7 like in the case of something like Amazon Redshift. It’s all on demand i.e. pay as you go.
However, it’s a double edged sword. At the end of the month, if you’ve not been keeping an eye on your bill(s), or you’ve been trigger happy with smashing through mountains of data, then you can be in for a nasty surprise come bill time. Just ask my colleague Gareth Jones about that.
To mitigate this risk, Google now let you impose cost controls on BigQuery. You set set the maximum about of data (in 10TB increments) at the whole project level, or at a user level. Even if your bills are not currently an issue, we suggest imposing costs controls just in case!
Long term storage discounts
In BigQuery you pay for two things – the amount of data that you query, and the storage of data. If you have a lot of data, then storage costs can quickly mount up.
Google have now introduced long term storage discounts. If a table is not edited for 90 consecutive days, the price of storage for that table automatically drops by 50% to $0.01 per GB per month. One thing to note is that this only applies to BigQuery native storage/tables and not federated sources.
Automatic schema inference
When loading data into BigQuery you normally need to specify the schema of the table i.e. column name, and type (e.g. STRING, FLOAT etc.). But not anymore! BigQuery will try its best to automatically infer the schema of the data you are loading or querying (federated source). The following formats are supported:
- Cloud Datastore
Although, auto schema detection only seems to be supported on the command line or via the API. We’d like to see it supported in the web UI too so our business users don’t need to come and ask us to do it for them!
Drive & Sheets integration
Our business users absolutely love this one! They can now save the results of their queries directly in Drive and share/collaborate/visualize them. They also query/load files directly in Drive without the need to export and load them separately into BigQuery. Just remember to enable the Drive APIs for this to work.
Apache AVRO support
We don’t actually use the AVRO format for any of our projects, but it’s something we’ve had an eye on for a while. JSON is inherently a poor choice for a format when you want to load or export massive datasets. Every entry has the same schema, and as such, the representation is extremely redundant, essentially duplicating the schema in string form for every record. AVRO is much more performant as it shares the a single schema for all records. This reduces the size of each individual record to correspond to the actual field values. Using AVRO should make ingestion time much quicker.
Audit logs (beta)
This feature was released back in December 2015, but I think it went somewhat unnoticed. However, we immediately enabled it when it came out, and it has proved invaluable when troubleshooting or analysing our queries and jobs which come flooding in from multiple projects. Audit logs for BigQuery comprise two logs, Administrative Activity and Data Access, both of which are enabled automatically. We also pump the logs back into BigQuery (simply click of a button in the console) where we do further analysis on them.