31 May 2017 BigQuery & new users – the top “WTF!?” moments
“What the Fudge?”
I use Google BigQuery a lot. On a daily basis I run dozens of queries, use it to build massively scalable data pipelines for our clients, and regularly help new users navigating it for the first time. Suffice it to say I’m somewhat accustomed to its little quirks. Unfortunately, the same can’t be said for the new users who are commonly left scratching their heads, and shouting “What the fudge!?” at their monitors.
Here’s the top three WTFs that I regularly hear from new BigQuery users:
WTF #1 – Streamed data doesn’t show up
BigQuery has an awesome streaming API. When new users start streaming data into BigQuery, the first thing they do is eyeball the table using the preview to check if the data have arrived safe and sound (and understandably so). The problem is that there is a delay in streamed data showing up in the table preview. In addition to that, over the years I’ve come to learn that the table preview (and the table details for that matter) is buggy, and unfortunately can’t be trusted. Mwaaah!
How to turn that frown upside down: Always query the table directly to check whether or not it contains data. Never rely on the table preview.
WTF #2 – Running Standard SQL queries with Legacy SQL enabled (and vice versa)
This one can be absolute noodle-scratcher for new users. I’ve even witnessed some folk have total meltdowns, and hurl their machines across the office (at me) in utter frustration. BigQuery has two SQL dialects; standard and legacy.
Legacy is the dialect that was initially used when BigQuery was released into the wild back in 2012. It’s not pure SQL, and it’s not that bad to use – it’s just a bit funky (the syntax that is). I must admit, it took me some time to get used to it. However, some time last year Google made the (smart) move to support the 2011 SQL standard, which made a lot of people very happy indeed.
The confusion arises from the fact that by default BigQuery runs queries using legacy SQL. As a result, I commonly see new users start writing their queries using standard dialect (as they should do), but they fail to realise that they need to explicitly tell BigQuery they are using standard SQL. As a consequence, their queries won’t run, and they get all sorts of gnarly errors spat back at them. Ouch!
How to turn that frown upside down: Always use standard SQL (unless you want to take advantage of table decorators) when writing your queries. Be sure to disable the “Use Legacy SQL” option in the web UI, or get into the habit of prefixing all your queries with
#standardSQL, which will instruct BigQuery to run the query in standard mode.
WTF #3 – BigQuery doesn’t compress data (it actually does)
BigQuery sits atop of a state-of-the-art storage engine called ‘Capacitor‘. Capacitor stores data in columnar format rather than in a row oriented fashion. This storage technique facilitates phenomenally fast queries, and yes – before you ask – much better compression ratios. In addition, Capacitor can (in contrary to its predecessor ‘ColumnIO‘) can operate directly on compressed data when querying rather than first having to decompress it. Cool.
Why all the fuss then? Well, when users load a compressed file into BigQuery, it shows up as being much larger than its original compressed size. This incorrectly leads new users to think that BigQuery is not using compression under the hood. What’s happening is that BigQuery is displaying the uncompressed size to users. Why would Google do this? For one simple reason – queries are charged based on uncompressed rather than compressed data. In comparison, AWS’s Athena does charge based on the compressed size – #justsayin.
How to turn that frown upside down: Just remember that BigQuery always displays the uncompressed size of the data. That’s all folks.
Just like everything else, BigQuery isn’t perfect. But, it’s pretty darn close if you ask me. Yes, it’s got some minor annoyances – but show me a tool that doesn’t, and I’ll buy you a pint (if you live nearby and like Guinness). The way it scales, and its zero-ops model are what set it apart from all the rest. When I can effortlessly smash through 100 billion rows (10TB) in 45 seconds, then I’ll happily live with a few monitors been thrown in my direction by new users after they’ve spent a day trying to figure out why their streamed data isn’t showing up. Ahh, sheer bliss.