Introducing column based partitioning in BigQuery

Some background

When we started using Google BigQuery – almost five years ago now – it didn’t have any partitioning functionality built into it.  Heck, queries cost $20 p/TB back then too for goodness’ sake!  To compensate for this lack of functionality and to save costs, we had to manually shard our tables using the well known _YYYYMMDD suffix pattern just like everyone else.  This works fine, but it’s quite cumbersome, has some hard limits, and your SQL can quickly becomes unruly.

Then about a year ago, the BigQuery team released ingestion time partitioning.  This allowed users to partition tables based on the load/arrival time of the data, or by explicitly stating the partition to load the data into (using the $ syntax).  By using the _PARTITIONTIME pseudo-column, users were more easily able to craft their SQL, and save costs by only addressing the necessary partition(s).  It was a major milestone for the BigQuery engineering team, and we were quick to adopt it into our data pipelines.  We rejoiced and gave each other a lot of high-fives.

TEL Newsletter – February 2018

Shine’s TEL group was established in 2011 with the aim of publicising the great technical work that Shine does, and to raise the company’s profile as a technical thought-leader in the community through blogs, local meet up talks, and conference presentations. Every now and then (it started off as being monthly, but that was too much work), we curate all the noteworthy things that Shiners have been up to, and publish a newsletter. Read on for this month’s edition.

Trams, Shiners and Googlers!

Shine’s good friend Felipe Hoffa from Google was in Melbourne recently, and he took the time to catch up with our resident Google Developer Expert, Graham Polley. But, instead of just sitting down over a boring old coffee, they decided to take an iconic tram ride around the city. To make it even more interesting, they tested out some awesome Google Cloud technologies by using their phones to spin up a Cloud Dataflow cluster of 50 VMs, and process over 10 billion records of data in under 10 minutes! Check out the video they recorded:

Scheduling BigQuery jobs: this time using Cloud Storage & Cloud Functions

Intro

Post update: My good friend Lak over at Google has come up with a fifth option! He suggests using Cloud Dataprep to achieve the same. You can read his blog post about that over here. I had thought about using Dataprep, but because it actually spins up a Dataflow job under-the-hood, I decided to omit it from my list. That’s because it will take a lot longer to run (the cluster needs to spin up and it issues export and import commands to BigQuery), rather than issuing a query job directly to the BigQuery API. Also, there are extra costs involved with this approach (the query itself, the Dataflow job, and a Dataprep surcharge – ouch!). But, as Lak pointed out, this would be a good solution if you want to transform your data, instead of issuing a pure SQL request. However, I’d argue that can be done directly in SQL too 😉

Not so long ago, I wrote a blog post about how you can use Google Apps Script to schedule BigQuery jobs. You can find that post right here. Go have a read of it now. I promise you’ll enjoy it. The post got quite a bit of attention, and I was actually surprised that people actually take the time out to read my drivel.

It’s clear that BigQuery’s popularity is growing fast. I’m seeing more content popping up in my feeds than ever before (mostly from me because that’s all I really blog about). However, as awesome as BigQuery is, one glaring gap in its arsenal of weapons is the lack of a built-in job scheduler, or an easy way to do it outside of BigQuery.

That said however, I’m pretty sure that the boffins over in Googley-woogley-world are currently working on remedying that – by either adding schedulers to Cloud Functions, or by baking something directly into the BigQuery API itself. Or maybe both? Who knows!

TEL Newsletter – December 2017

Shine’s TEL group was established in 2011 with the aim of publicising the great technical work that Shine does, and to raise the company’s profile as a technical thought-leader in the community through blogs, local meet up talks, and conference presentations. Every now and then (it started off as being monthly, but that was too much work), we curate all the noteworthy things that Shiners have been up to, and publish a newsletter. Read on for this month’s edition.

My favourite talks from YOW! 2017 Melbourne

No food reviews here I’m afraid

This year I was incredibly lucky to score a coveted ticket to YOW! in beautiful Melbourne. I was also asked to be a track host for a couple of sessions, so that was quite an honour too. This post is a whirlwind wrap-up of the conference, and only includes my favourite talks from the two day event. If you’re hoping to hear detailed reviews on how the coffee/food/WiFi/venue was, then you’ll be greatly disappointed (it was all great BTW).

Scheduling BigQuery jobs using Google Apps Script

Do you recoil in horror at the thought of running yet another mundane SQL script just so a table is automatically rebuilt for you each day in BigQuery? Can you barely remember your name first thing in the morning, let alone remember to click “Run Query” so that your boss gets the latest data refreshed in his fancy Data Studio charts, and then takes all the credit for your hard work?

Well, fear not my fellow BigQuery’ians. There’s a solution to this madness.

It’s simple.

It’s quick.

Yes, it’s Google Apps Script to the rescue.

Disclaimer: all credit for this goes to the one and only Felipe Hoffa. He ‘da man!

TEL Newsletter – October 2017

Shine’s TEL group was established in 2011 with the aim of publicising the great technical work that Shine does, and to raise the company’s profile as a technical thought-leader through blogs, local meet up talks, and conference presentations. Each month, the TEL group gather up all the awesome things that Shine folk have been getting up to in and around the community. Here’s the latest roundup from what’s been happening.

TEL monthly newsletter – July 2017

Shine’s TEL group was established in 2011 with the aim of publicising the great technical work that Shine does, and to raise the company’s profile as a technical thought-leader through blogs, local meet up talks, and conference presentations. Each month, the TEL group gather up all the awesome things that Shine folk have been getting up to in and around the community. Here’s the latest roundup from what’s been happening.

TEL monthly newsletter – June 2017

Shine’s TEL group was established in 2011 with the aim of publicising the great technical work that Shine does, and to raise the company’s profile as a technical thought-leader through blogs, local meet up talks, and conference presentations. Each month, the TEL group gather up all the awesome things that Shine folk have been getting up to in and around the community. Here’s the latest roundup from what’s been happening.