Gobbling up big-ish data for lunch using BigQuery

Gobbling up big-ish data for lunch using BigQuery

Beers + ‘WSPR’ = fun

To this day, I’m a firm believer in the benefits of simple, informative, and spontaneous conversations with my colleagues – at least with the ones who can stand me long enough to chat with me . Chewing the fat with other like minded folks over a beer or two is a bloody good thing. It’s how ideas are born, knowledge is shared, and relationships are formed. It’s an important aspect of any business that is sadly all too often overlooked.

6439_1_amateur_radio_is_still_alive_and_kicking_journey_to_becoming_a_ham

No, that’s not us at the gathering

At a recent gathering of Shiners (yes, there were beers involved), I got chatting with fellow Shiner Aaron Brown. We were nonchalantly discussing the client projects that we were currently working on (and bad-mouthing our manager Luke like we always do), when the topic turned to our side projects.

Aaron began to tell me about his keen interest in WSPR, and how he was looking for a good/fast way to analyse all the public data that they had available for download on the website.

The Weak Signal Propagation Reporter Network is a group of amateur radio operators using K1JT’s MEPT_JT digital mode to probe radio frequency propagation conditions using very low power (QRP/QRPp) transmissions. The software is open source, and the data collected are available to the public through [the] site.

He had some cool ideas for analysing the data, but so far had only been able to analyse a small part of the data using using a local MySQL instance. I was quick to tell him that we’d been really impressed by Google BigQuery on our projects at Telstra, and how I thought it might be just the ticket. It would easily be able to handle the amount of data he was looking at – and with $300 free credits on Google Cloud Platform, Aaron would easily be able to analyze the WSPR data to his heart’s content (Google charge $5 per TB queried in BigQuery).

We agreed that when we both had some spare time we’d give it a shot and try and load it into BigQuery so he could begin analyzing it. The data Aaron wanted to look at was about eight years worth, 500M rows and roughly 60GB. Not a huge amount by any stretch of the imagination, but enough to warrant a BigQuery solution in my opinion. It sounded like a good plan, and more importantly, it sounded like great fun.

One hangover…eh..I mean day…later, I was fumbling around on my machine trying to cause some trouble during my lunch break, when I thought to myself about doing the WSPR load for Aaron. It should take only an hour, I thought.

I was wrong. It took less than 45 minutes.

Entrée: spin up a Google Compute Engine instance

Duration: about 30 seconds

Actually, I didn’t even need to spin up a GCE instance (VM). The Google Cloud console has a really cool feature called “Cloud Shell“. It gives you command-line access to your own personal GCE under the hood -directly in the bowser. It even comes with the gcloud tools and components out-of-the-box. Awesome.

screen-shot-2017-01-18-at-12-11-52-pm

Mains: eight lines of bash

Duration: ~30 minutes

The WSPR download page clearly lists all the files available for download. The data goes back to 2008-03 historically. So, a simple bash script with a loop, and wget was all I needed to fetch the files. Once the files were downloaded, then a simple call to gsutil to upload them to a bucket (‘wsprnet’) in Google Cloud Storage. And that was that.

#!/usr/bin/env bash 
START=2008-03 
while [ "$START" != 2017-02 ]; do 
 echo $START 
 wget -P /etc/wsprnet/ http://wsprnet.org/archive/wsprspots-${START}.csv.gz 
 START=$(date -j -v +1m -f "%Y-%m" ${START} +%Y-%m) 
done
gsutil -m cp /etc/wsprnet/* gs://wsprnet/

Dessert: load data from GCS into BigQuery

Duration: ~10 minutes

Once the bash scrip had finished, I had all the data sitting nice and pretty in GCS. I had downloaded the compressed (GZ) versions of the files so it would be quicker. I knew from some previous work that I’d done with BigQuery that it supports loading GZIP files in from GCS. Too easy.

screen-shot-2017-01-18-at-12-28-58-pm

All I needed to do next was to load the data/files into a table in BigQuery. So, I flicked over to the web UI for BigQuery and configured the following simple load job:

screen-shot-2017-01-18-at-12-33-25-pm

That took just 10 minutes to load 500M rows and about 60GB of data. Nice. The final result of that load job was all the WSPR data now waiting to be analysed by Aaron!

screen-shot-2017-01-13-at-1-51-32-pm

screen-shot-2017-01-13-at-2-01-10-pm

Digestive

Duration: 1 minute

Tools like BigQuery are commoditising big data. It’s so easy and quick to get up and running nowadays, no matter the size of the data you want to analyse. Who wants to be mucking around with Hadoop/Hive clusters, or on-prem database instances, when you can use tools like these, seriously?

Stay tuned for part two of this post, when Aaron will tell us all about how his analysis went using BigQuery. Oh, and by the way, we’ve made the table in BigQuery public, so anyone can access/analyse it. You can find it here.

server.admin@shinesolutions.com
3 Comments

Leave a Reply

%d bloggers like this: