Setting the scene
A couple of months ago my colleague Graham Polley wrote about how we got started analysing 8+ years worth of of WSPR (pronounced ‘whisper’) data. What is WSPR? WSPR, or Weak Signal Propagation Reporter, is signal reporting network setup by radio amateurs for monitoring the ability for radio signals to get from one place to another. Why would I care? I’m a geek and I like data. More specifically the things it can tell us about seemingly complex processes. I’m also a radio amateur, and enjoy the technical aspects of communicating around the globe with equipment I’ve built myself.
I also find it mind blowing that you can be heard on the other side of the globe with WSPR using the same amount of power as a mobile phone, when I can’t make a phone call on some sections of my commute. In this post I’m going to pick up where Graham left off and dive further into what the WSPR data set has to offer, comparing BigQuery and my initial MySQL-based attempt along the way.
When I initially conceived of this idea I discounted BigQuery as being ‘too big’. It felt like using a sledge hammer to crack a walnut. So I fired up MySQL and began to load the data in so I could start querying. The WSPRnet website helpfully has the data going back to March 2008 downloadable for each month, with the current month being updated daily. I downloaded January’s data set on the 10th by which time it contained approximately 5.5million entries. This is where the waiting began. It had been several years since I last setup a MySQL installation, although it was relatively painless to get it installed under Ubuntu, I spent probably 10 minutes creating users and the initial table structure, by which point if you read Graham’s post he was already on the home stretch by now. Below is the structure I created in MySQL:
I then began to load the data. Given I was only using 10 days of data, I expected this to be a largely painless affair. Some reading of the manual brought me to MySQL’s data load command, which after another 20 minutes of manipulating had me loading the data directly from CSV into the table using the following:
load data concurrent local infile ‘/home/aaron/Downloads/wsprspots-2017-01.csv’
replace into table wsprDB.wsprspot fields terminated by ‘,’
The whole load process took around 10 minutes to complete for 5.5million records, multiplying this out for the entire dataset, which is 100x larger at 0.5 billion records, it would take around 1000 minutes or 17 hours to load the data. Whilst there were no time constraints, I wanted to start querying immediately, so I put loading the rest of the data on hold.
What can we find out
I started querying the data with some simple queries, for instance for a given WSPR station, what’s the average distance at which it can be heard for each frequency? This is readily achievable, you’ve probably already thought of the SQL, but here it is:
GROUP BY Band
And I waited, and waited and it took 2 minutes to return. The astute will notice, I’d forgotten to create indexes on the columns I was querying on – doh! Rectifying this mistake took around 20 minutes of mostly waiting for MySQL to index, before my query was returning in ~2 seconds. Better, but still not great given this is only a fraction if the data. This highlights one of the major advantages of BigQuery – you can simply load the data and start querying without having to consciously address such performance concerns.
Now we’ve warmed up, I’ve refreshed my memory of SQL, got the data loaded into both MySQL and BigQuery, and both datasets are queryable. It’s time to do some real “work” and see what we can find out from the data, comparing BigQuery to my initial MySQL based approach.
Working out where to transmit
One thing if we want our signal to be heard as far away as possible, we want to transmit on the right frequency/band. It’s no use transmitting where no-one is listening, or an frequency that doesn’t travel very far. We can readily find the average distance for each band from the data with the following SQL:
group by Band order by Band
Running this on MySQL took 97s on the reduced dataset compared to BigQuery which returned the results in a mere 2.5s for the complete dataset. Wow. BigQuery processed 100x the data in < 1/10 of the time. The results were downloaded from the console as csv and plotted using excel.
Finding the right time
As mentioned in the introduction, WSPR can be used to communicate around the world using only milliwatts of power. Although I should point out that this does happen at all times of day, and on all bands. So, let’s see if we can find out when we’re most likely to hear someone from the data. We could determine if there is a particular time of day when a large number of transmitters can be heard from a particular country. If this is true, it would suggest that we could also use simpler modes, such as voice or Morse code to communicate, rather than the limited messages which WSPR sends. Alternatively, we could also calculate the average distance at which transmitters from a particular country are received. Given I’m based in Australia, and all Australian radio amateur calls signs begin with the prefix ‘VK’, we can quickly count the number of other people having heard Australian signals by time of day with the following query:
SELECT hour(Timestamp) as hour,Band,avg(Distance)
where Call_Sign like ‘VK%’
group by hour,Band
order by Band,hour
Running this on Mysql took around 10 seconds to complete on the reduced data set compared to 2.5s on BigQuery for the full data set. A win for big query, especially given the MySQL dataset is 1/100 of the full set. And the result? See for yourself…
Here I’ve just pulled out the results for a single country frequency, which shows the best time of Day is about 11am GMT, or 10pm AEST. A bit late in the evening, but who needs sleep when you’ve got radio and BigQuery?
Countries by time:
But what if distance wasn’t all we cared about? Many of those in amateur radio like to talk, and so for them the hobby is all about talking to as many people from different countries and backgrounds as possible. Once more we can find out from the data which time of day is the best to contact as many countries as possible. As mentioned previously the first few characters of the call sign determines its country. We can therefore group on these, and count them when at each time of day each country can be heard, as I’ve done in the following query:
select count(f0_), hour from (SELECT substr(Reporter,0,2),hour(Timestamp) as hour,count(Spot_ID)
WHERE Call_Sign like ‘VK%’ and band=10
group by f0_,hour, order by f0_,hour) group by hour
As for the times both returned results after 3 seconds, which produce the following graph:
Which suggests that for the 30m band, after about 6am GMT, it doesn’t matter what time of day we can hear the same number of prefixes and potentially countries. It’s interesting that that MySQL was actually quicker in this query, although this is likely due to the where clause greatly reducing the number of rows MySQL had to process.
We have taken a dive through 8+ years of propagation data, to determine how, and when we can get the most people to hear us. We’ve also shown that even for relatively modest problems of only a few GB, BigQuery can outperform MySQL running on modest hardware. Given the ease with which the data can be loaded and manipulated, BigQuery has come out on top in almost every query I performed. BigQuery also gives you 1TB per month for free, which is more than enough for small side projects like this. By using BigQuery you can spend more time on the interesting bits of the project, rather than optimising poorly performing databases and queries. I will certainly continue to use BigQuery for the rest of this project which will include a web application to make querying and graphing data a breeze for others, but I’ll save that for another post.