Scheduling BigQuery jobs using Google Apps Script

Do you recoil in horror at the thought of running yet another mundane SQL script just so a table is automatically rebuilt for you each day in BigQuery? Can you barely remember your name first thing in the morning, let alone remember to click “Run Query” so that your boss gets the latest data refreshed in his fancy Data Studio charts, and then takes all the credit for your hard work?

Well, fear not my fellow BigQuery’ians. There’s a solution to this madness.

It’s simple.

It’s quick.

Yes, it’s Google Apps Script to the rescue.

Disclaimer: all credit for this goes to the one and only Felipe Hoffa. He ‘da man!

To Partition or not to Partition

I have been using BigQuery for over 2 years now at Shine. I’ve found it to be a great tool that is both incredibly fast and able to handle some of our largest workloads. We are processing terabytes of data per day, and each day an extra billion records are added to the store.

But unfortunately this growth is also increasing our costs of running queries. While BigQuery is extremely fast and parallel, it comes at the cost of needing to scan and pay for every record of the columns you are querying. Without the indexes offered by conventional databases, a full table scan is needed for each query. Not only that but when you query large amounts of data the speed of your query slows down:In this post I’ll talk about how we used table partitions to increase the performance of our queries and avoid query slowdowns.

BigQuery & new users – the top “WTF!?” moments

“What the Fudge?”

I use Google BigQuery a lot. On a daily basis I run dozens of queries, use it to build massively scalable data pipelines for our clients, and regularly help new users navigating it for the first time. Suffice it to say I’m somewhat accustomed to its little quirks. Unfortunately, the same can’t be said for the new users who are commonly left scratching their heads, and shouting “What the fudge!?” at their monitors.

Here’s the top three WTFs that I regularly hear from new BigQuery users:

A post Google I/O 2017 conversation with Google Home

OK Google, generate a clickbait title for my Google I/O 2017 blog post

I’ve generated a title, Gareth. What would you like to add next?

OK Google, I’m a bit jet lagged – remind me what I saw at Google I/O 2017

I would love to help, Gareth, but I’m going to need a little more information. Would you like that information in chronological order, or grouped by topic?

Beam me up Google – porting your Dataflow applications to 2.x

Will this post interest me?

If you use (or intend to use) Google Cloud Dataflow, you’ve heard about Apache Beam, or if you’re simply bored in work today and looking to waste some time, then yes, please do read on. This short post will cover why our team finally took the plunge to start porting some of Dataflow applications (using the 1.x Java SDKs) to the new Apache Beam model (2.x Java SDK). Spoiler – it has something to do with this. It will also highlight the biggest changes we needed to make when making the switch (pretty much just fix some compile errors).

Whispers from the other side of the globe with BigQuery

Setting the scene

A couple of months ago my colleague Graham Polley wrote about how we got started analysing 8+ years worth of of WSPR (pronounced ‘whisper’) data. What is WSPR? WSPR, or Weak Signal Propagation Reporter, is signal reporting network setup by radio amateurs for monitoring the ability for radio signals to get from one place to another. Why would I care? I’m a geek and I like data. More specifically the things it can tell us about seemingly complex processes. I’m also a radio amateur, and enjoy the technical aspects of  communicating around the globe with equipment I’ve built myself.

Homer simpson at Radio transceiver
Homer Simpson as a radio Amateur

Triggering Dataflow Pipelines With Cloud Functions

Do you have an unreasonable fear of cronjobs? Find spinning up VMs to be a colossal waste of your towering intellect? Does the thought of checking a folder regularly for updates fill you with an apoplectic rage? If so, you should probably get some help. Maybe find another line of work.

In the meantime, here’s one way to ease your regular file processing anxieties. With just one application of Google Cloud Functions, eased gently up your Dataflow Pipeline, you can find lasting relief from troublesome cronjobs.

Gobbling up big-ish data for lunch using BigQuery

Beers + ‘WSPR’ = fun

To this day, I’m a firm believer in the benefits of simple, informative, and spontaneous conversations with my colleagues – at least with the ones who can stand me long enough to chat with me . Chewing the fat with other like minded folks over a beer or two is a bloody good thing. It’s how ideas are born, knowledge is shared, and relationships are formed. It’s an important aspect of any business that is sadly all too often overlooked.

Analysing Stack Overflow comment sentiment using Google Cloud Platform

The decline of Stack Overflow?

A few months back I read this post from 2015 (yes, I know I’m a little late to the party) about how Stack Overflow (SO) was in serious decline, and heading for total and utter oblivion.  In the post, the first item to be called  out was that SO “hated new users“:

Stack Overflow has always been a better-than-average resource for finding answers to programming questions. In particular, I have found a number of helpful answers to really obscure questions on the site, many of which helped me get past a road block either at work or in my hobby programming. As such, I decided I’d join the site to see if I could help out. Never before has a website given me a worse first impression.

At the time, I remember thinking that this seemed like somewhat of an unfair statement. That was mostly down to the fact that when I joined the community (many years ago), I had fond memories of a smooth on-boarding, and never experienced any snarky remarks on my initial questions. Yes, gaining traction for noobs is very, very hard, but there is a good reason why it exists.

For me, SO is invaluable. How else would I be able to pretend to know what I’m doing? How else could I copy and paste code from some other person who’s obviously a lot smarter than me, and take all the credit for it? Anyway, once I had read the post, and gotten on with my life (e.g. copying and pasting more code from SO), I did’t think too much more about the post. Maybe I had just been lucky with my foray into the SO community?

However, just last week, I was reminded of that post once again, when I noticed that BigQuery (BQ) now has a public dataset which includes all the data from SO – including user comments and answers. Do you see where I am going with this yet? If not, then don’t worry. Neither did I when I started writing this.

Will Athena slay BigQuery?

*Updated on 16th December 2016 – see below

With the announcement of Amazon Athena at this year’s AWS re-invent conference, I couldn’t help but notice its striking similarity with another rival cloud offering. I’m talking about Google’s BigQuery. Athena is a managed service allowing customers to query objects stored in an S3 bucket. Unlike other AWS offerings like Redshift, you only need to pay for the queries you run. There is no need to manage or pay for infrastructure that you may not be using all the time. All you need to do is define your table schema and reference your files in S3. This works in a similar way to BigQuery’s federated sources which reference files in Google Cloud Storage.

Given this, I thought it would be interesting to compare the two platforms to see how they stack up against each other. I wanted to find out which one is the fastest, which one is more feature rich and which is the most reliable.