Introducing column based partitioning in BigQuery

Some background

When we started using Google BigQuery – almost five years ago now – it didn’t have any partitioning functionality built into it.  Heck, queries cost $20 p/TB back then too for goodness’ sake!  To compensate for this lack of functionality and to save costs, we had to manually shard our tables using the well known _YYYYMMDD suffix pattern just like everyone else.  This works fine, but it’s quite cumbersome, has some hard limits, and your SQL can quickly becomes unruly.

Then about a year ago, the BigQuery team released ingestion time partitioning.  This allowed users to partition tables based on the load/arrival time of the data, or by explicitly stating the partition to load the data into (using the $ syntax).  By using the _PARTITIONTIME pseudo-column, users were more easily able to craft their SQL, and save costs by only addressing the necessary partition(s).  It was a major milestone for the BigQuery engineering team, and we were quick to adopt it into our data pipelines.  We rejoiced and gave each other a lot of high-fives.

Google Cloud Community Conference 2018

As a co-organizer for GDG Cloud Melbourne, I was recently invited to the Google Cloud Developer Community conference in Sunnyvale, California. It covered meetup organization strategies and product roadmaps, and was also a great opportunity to network with fellow organizers and Google Developer Experts (GDEs) from around the world.  Attending were 68 community organizers, 50 GDEs and 9 open source contributors from a total of 37 countries.

I would have to say it was the most social conference I have ever attended. There were a lot of opportunities to meet with people from a wide range of backgrounds. I also got many valuable insights into how I could better run our meetup and better make use of Google products. In this post I’ll talk about what we got up to over the two days.

Thoughts on the ‘AWS Certified SysOps Administrator – Associate’ exam

A couple of weeks ago was a significant milestone in my 14-year IT career: I actually sat a certification exam. In this case, it was the AWS Certified SysOps Administrator – Associate Exam.

Despite some trepidation during my preparation for the exam, on the day I found it quite straightforward and came out with a pass mark. In this post I’m going to share some of my thoughts and notes in the hope that it will help others preparing to sit this exam.

Using Google Cloud AutoML to classify poisonous Australian spiders

Warning: This post contains pictures of spiders (and Spiderman)!

Google’s new Cloud AutoML Vision is a new machine learning service from Google Cloud that aims to make state of the art machine learning techniques accessible to non-machine learning experts. In this post I will show you how I was able, in just a few hours, to create a custom image classifier that is able to distinguish between different types of poisonous Australian spiders. I didn’t have any data when I started and it only required a very basic understanding of machine learning related concepts. I could probably show my Mum how to do it!

Trams, Shiners and Googlers!

Shine’s good friend Felipe Hoffa from Google was in Melbourne recently, and he took the time to catch up with our resident Google Developer Expert, Graham Polley. But, instead of just sitting down over a boring old coffee, they decided to take an iconic tram ride around the city. To make it even more interesting, they tested out some awesome Google Cloud technologies by using their phones to spin up a Cloud Dataflow cluster of 50 VMs, and process over 10 billion records of data in under 10 minutes! Check out the video they recorded:

The best code is no code! Using Google Cloud’s new automated services.

Here in Australia, we do a lot of work on Google Cloud Platform for one of the country’s largest ISPs, Telstra. Most of that work involves building data pipelines and running analytics off the back of them for their Media business unit. As you can well imagine, they generate a huge amount of data on a daily basis. We use tools like BigQuery, Cloud Dataflow and Data Studio to wrangle, manage, and understand that data.

On one such project for Telstra, we saw an opportunity to delete three code repositories and finally rid ourselves of some of the headaches associated with maintaining those applications, all the while saving money on the operational costs.

We were able to replace the system comprising these repos with two new Google Cloud Platform services:

In this blog post, I’ll introduce you to those new services that Google have spun up, and how we were able to use them to replace our legacy applications. Who doesn’t like a good spring clean, huh?

Scheduling BigQuery jobs: this time using Cloud Storage & Cloud Functions


Post update: My good friend Lak over at Google has come up with a fifth option! He suggests using Cloud Dataprep to achieve the same. You can read his blog post about that over here. I had thought about using Dataprep, but because it actually spins up a Dataflow job under-the-hood, I decided to omit it from my list. That’s because it will take a lot longer to run (the cluster needs to spin up and it issues export and import commands to BigQuery), rather than issuing a query job directly to the BigQuery API. Also, there are extra costs involved with this approach (the query itself, the Dataflow job, and a Dataprep surcharge – ouch!). But, as Lak pointed out, this would be a good solution if you want to transform your data, instead of issuing a pure SQL request. However, I’d argue that can be done directly in SQL too 😉

Not so long ago, I wrote a blog post about how you can use Google Apps Script to schedule BigQuery jobs. You can find that post right here. Go have a read of it now. I promise you’ll enjoy it. The post got quite a bit of attention, and I was actually surprised that people actually take the time out to read my drivel.

It’s clear that BigQuery’s popularity is growing fast. I’m seeing more content popping up in my feeds than ever before (mostly from me because that’s all I really blog about). However, as awesome as BigQuery is, one glaring gap in its arsenal of weapons is the lack of a built-in job scheduler, or an easy way to do it outside of BigQuery.

That said however, I’m pretty sure that the boffins over in Googley-woogley-world are currently working on remedying that – by either adding schedulers to Cloud Functions, or by baking something directly into the BigQuery API itself. Or maybe both? Who knows!

Fun with Serializable Functions and Dynamic Destinations in Cloud Dataflow

Waterslide analogy. One input, multiple outputs. Each slide represents a date partition in one table.

Do you have some data that needs to be fed into BigQuery but the output must be split between multiple destination tables? Using a Cloud Dataflow pipeline, you could define some side outputs for each destination table you need, but what happens when you want to write to date partitions in a table and you’re not sure what partitions you need to write to in advance? It gets a little messy. That was the problem I encountered, but we have a solution.

My favourite talks from YOW! 2017 Melbourne

No food reviews here I’m afraid

This year I was incredibly lucky to score a coveted ticket to YOW! in beautiful Melbourne. I was also asked to be a track host for a couple of sessions, so that was quite an honour too. This post is a whirlwind wrap-up of the conference, and only includes my favourite talks from the two day event. If you’re hoping to hear detailed reviews on how the coffee/food/WiFi/venue was, then you’ll be greatly disappointed (it was all great BTW).

re:Invent 2017: Day 2

The last time I was fortunate enough to attend AWS’s global conference, re:Invent, was three years ago in 2014. Then there were 14,000 delegates and the conference spanned just two Las Vegas hotels. Lambda was announced during Werner Vogels’ keynote and it seemed that the most in-demand sessions had “Docker” in the title.

In just three years the conference has tripled in size with 43,000 delegates attending this year spread across a campus of six Las Vegas hotels. Although not one of the biggest conferences held in Vegas, it’s obviously a significant logistical challenge. After some hiccups on the first day with the inter-venue shuttles and a venue running out of food, everything seemed to settle down and run smoothly from the start of the second day. Whether the improvement was due to human learnings of the hivemind or training of some Machine Learning algorithms is up for debate but almost certainly it was a combination of the two. No, actually, the transport still is not good and Uber is key to success.