
18 Jan 2017 Gobbling up big-ish data for lunch using BigQuery
Beers + ‘WSPR’ = fun
To this day, I’m a firm believer in the benefits of simple, informative, and spontaneous conversations with my colleagues – at least with the ones who can stand me long enough to chat with me . Chewing the fat with other like minded folks over a beer or two is a bloody good thing. It’s how ideas are born, knowledge is shared, and relationships are formed. It’s an important aspect of any business that is sadly all too often overlooked.
At a recent gathering of Shiners (yes, there were beers involved), I got chatting with fellow Shiner Aaron Brown. We were nonchalantly discussing the client projects that we were currently working on (and bad-mouthing our manager Luke like we always do), when the topic turned to our side projects.
Aaron began to tell me about his keen interest in WSPR, and how he was looking for a good/fast way to analyse all the public data that they had available for download on the website.
The Weak Signal Propagation Reporter Network is a group of amateur radio operators using K1JT’s MEPT_JT digital mode to probe radio frequency propagation conditions using very low power (QRP/QRPp) transmissions. The software is open source, and the data collected are available to the public through [the] site.
He had some cool ideas for analysing the data, but so far had only been able to analyse a small part of the data using using a local MySQL instance. I was quick to tell him that we’d been really impressed by Google BigQuery on our projects at Telstra, and how I thought it might be just the ticket. It would easily be able to handle the amount of data he was looking at – and with $300 free credits on Google Cloud Platform, Aaron would easily be able to analyze the WSPR data to his heart’s content (Google charge $5 per TB queried in BigQuery).
We agreed that when we both had some spare time we’d give it a shot and try and load it into BigQuery so he could begin analyzing it. The data Aaron wanted to look at was about eight years worth, 500M rows and roughly 60GB. Not a huge amount by any stretch of the imagination, but enough to warrant a BigQuery solution in my opinion. It sounded like a good plan, and more importantly, it sounded like great fun.
One hangover…eh..I mean day…later, I was fumbling around on my machine trying to cause some trouble during my lunch break, when I thought to myself about doing the WSPR load for Aaron. It should take only an hour, I thought.
I was wrong. It took less than 45 minutes.
Entrée: spin up a Google Compute Engine instance
Duration: about 30 seconds
Actually, I didn’t even need to spin up a GCE instance (VM). The Google Cloud console has a really cool feature called “Cloud Shell“. It gives you command-line access to your own personal GCE under the hood -directly in the bowser. It even comes with the gcloud
tools and components out-of-the-box. Awesome.
Mains: eight lines of bash
Duration: ~30 minutes
The WSPR download page clearly lists all the files available for download. The data goes back to 2008-03 historically. So, a simple bash script with a loop, and wget
was all I needed to fetch the files. Once the files were downloaded, then a simple call to gsutil
to upload them to a bucket (‘wsprnet’) in Google Cloud Storage. And that was that.
#!/usr/bin/env bash START=2008-03 while [ "$START" != 2017-02 ]; do echo $START wget -P /etc/wsprnet/ http://wsprnet.org/archive/wsprspots-${START}.csv.gz START=$(date -j -v +1m -f "%Y-%m" ${START} +%Y-%m) done gsutil -m cp /etc/wsprnet/* gs://wsprnet/
Dessert: load data from GCS into BigQuery
Duration: ~10 minutes
Once the bash scrip had finished, I had all the data sitting nice and pretty in GCS. I had downloaded the compressed (GZ) versions of the files so it would be quicker. I knew from some previous work that I’d done with BigQuery that it supports loading GZIP files in from GCS. Too easy.
All I needed to do next was to load the data/files into a table in BigQuery. So, I flicked over to the web UI for BigQuery and configured the following simple load job:
That took just 10 minutes to load 500M rows and about 60GB of data. Nice. The final result of that load job was all the WSPR data now waiting to be analysed by Aaron!
Digestive
Duration: 1 minute
Tools like BigQuery are commoditising big data. It’s so easy and quick to get up and running nowadays, no matter the size of the data you want to analyse. Who wants to be mucking around with Hadoop/Hive clusters, or on-prem database instances, when you can use tools like these, seriously?
Stay tuned for part two of this post, when Aaron will tell us all about how his analysis went using BigQuery. Oh, and by the way, we’ve made the table in BigQuery public, so anyone can access/analyse it. You can find it here.
Pingback:TEL monthly newsletter – Feb 2017 – Shine Solutions Group
Posted at 21:06h, 08 March[…] Polley wanted to do something interesting on his lunch break, so he decided to load all of the publicly available WSPR data into BigQuery – 8 years worth in just 40 minutes! This made fellow Shiner Aaron Brown very happy […]
Pingback:Whispers from the other side of the globe with BigQuery – Shine Solutions Group
Posted at 10:41h, 18 April[…] couple of months ago my colleague Graham Polley wrote about how we got started analysing 8+ years worth of of WSPR (pronounced ‘whisper’) data. […]
Pingback:What nobody at Uni will tell you about being a Software Developer – Shine Solutions Group
Posted at 16:24h, 31 October[…] was given some documentation and blogs to read, related to my first task. This covered Google Cloud Platform services: BigQuery and Cloud […]