Introducing column based partitioning in BigQuery

Some background

When we started using Google BigQuery – almost five years ago now – it didn’t have any partitioning functionality built into it.  Heck, queries cost $20 p/TB back then too for goodness’ sake!  To compensate for this lack of functionality and to save costs, we had to manually shard our tables using the well known _YYYYMMDD suffix pattern just like everyone else.  This works fine, but it’s quite cumbersome, has some hard limits, and your SQL can quickly becomes unruly.

Then about a year ago, the BigQuery team released ingestion time partitioning.  This allowed users to partition tables based on the load/arrival time of the data, or by explicitly stating the partition to load the data into (using the $ syntax).  By using the _PARTITIONTIME pseudo-column, users were more easily able to craft their SQL, and save costs by only addressing the necessary partition(s).  It was a major milestone for the BigQuery engineering team, and we were quick to adopt it into our data pipelines.  We rejoiced and gave each other a lot of high-fives.

Scheduling BigQuery jobs: this time using Cloud Storage & Cloud Functions

Intro

Post update: My good friend Lak over at Google has come up with a fifth option! He suggests using Cloud Dataprep to achieve the same. You can read his blog post about that over here. I had thought about using Dataprep, but because it actually spins up a Dataflow job under-the-hood, I decided to omit it from my list. That’s because it will take a lot longer to run (the cluster needs to spin up and it issues export and import commands to BigQuery), rather than issuing a query job directly to the BigQuery API. Also, there are extra costs involved with this approach (the query itself, the Dataflow job, and a Dataprep surcharge – ouch!). But, as Lak pointed out, this would be a good solution if you want to transform your data, instead of issuing a pure SQL request. However, I’d argue that can be done directly in SQL too 😉

Not so long ago, I wrote a blog post about how you can use Google Apps Script to schedule BigQuery jobs. You can find that post right here. Go have a read of it now. I promise you’ll enjoy it. The post got quite a bit of attention, and I was actually surprised that people actually take the time out to read my drivel.

It’s clear that BigQuery’s popularity is growing fast. I’m seeing more content popping up in my feeds than ever before (mostly from me because that’s all I really blog about). However, as awesome as BigQuery is, one glaring gap in its arsenal of weapons is the lack of a built-in job scheduler, or an easy way to do it outside of BigQuery.

That said however, I’m pretty sure that the boffins over in Googley-woogley-world are currently working on remedying that – by either adding schedulers to Cloud Functions, or by baking something directly into the BigQuery API itself. Or maybe both? Who knows!

Scheduling BigQuery jobs using Google Apps Script

Do you recoil in horror at the thought of running yet another mundane SQL script just so a table is automatically rebuilt for you each day in BigQuery? Can you barely remember your name first thing in the morning, let alone remember to click “Run Query” so that your boss gets the latest data refreshed in his fancy Data Studio charts, and then takes all the credit for your hard work?

Well, fear not my fellow BigQuery’ians. There’s a solution to this madness.

It’s simple.

It’s quick.

Yes, it’s Google Apps Script to the rescue.

Disclaimer: all credit for this goes to the one and only Felipe Hoffa. He ‘da man!

Triggering Dataflow Pipelines With Cloud Functions

Do you have an unreasonable fear of cronjobs? Find spinning up VMs to be a colossal waste of your towering intellect? Does the thought of checking a folder regularly for updates fill you with an apoplectic rage? If so, you should probably get some help. Maybe find another line of work.

In the meantime, here’s one way to ease your regular file processing anxieties. With just one application of Google Cloud Functions, eased gently up your Dataflow Pipeline, you can find lasting relief from troublesome cronjobs.

Filling the BigQuery paddling pool from the Kinesis Hosepipe

Introduction

This blog post details how we solved the problem of analysing large amounts of HTTP request logs for one of our clients. Spoiler: we used Amazon’s Kinesis and Lambda to stream the data into Google’s BigQuery for analysis. Read on for the juicy details!

Performance Comparison Between Node.js and Java EE For Reading JSON Data from CouchDB

Nodejs

Node.js has impressed me several times with high performance right out of the box. In my last Node.js project it was the same: we beat the given performance targets without having to tweak the application at all. I never really experienced this in Java EE projects. Granted the project was perfectly suited for Node.js: a small project centered around fetching JSON documents from CouchDB. Still I wanted to know: how would Java EE compare to Node.js in this particular case? 

JavaOne 2013 Day Zero

The 18th JavaOne started this Sunday in San Francisco. Covering three hotels in downtown SF, Hilton, Parc55 and Nikkon and with keynotes in the Moscone Centre, Oracle OpenWorld is hands down the biggest conference I’ve ever attended.  It covers 5 days in total and over 500 sessions plus side exhibitions and events. And it certainly needs the room, not only to accommodate the number of attendees going to these sessions but the breadth of the platform itself.

Video of Marc Fasel’s Presentation at Web Directions Code 2013

Shine Technologies’ senior consultant Marc Fasel gave a presentation on asynchronous JavaScript with Node.js at this year’s Web Directions Code conference May 2-3. The title of the presentation is Put on Your Asynchronous Hat and Node, and the video is now available.

Web Directions Code 2013

logoThe second incarnation of the Web Directions Code conference was this year on 2.-3. May, and I was fortunate enough to both attend and speak. The conference held in Melbourne Town Hall was the place to geek out about all things web: JavaScript, HTML5, and CSS3.  More than 250 web professional turned up to learn about new directions the web is taking. In this blog I will summarise those talks that left the biggest impression on me.