For me, Jaws is hands down one of the best movies ever made. It’s almost 40 years old but it still looks fantastic and the acting is phenomenal. And it’s able to boast one of the most memorable ad-libs ever quipped by any actor on the big screen:
“You’re gonna need a bigger boat”
Roy Schneider brilliantly delivers this line after seeing the shark for the first time. It’s a fabulous scene and the one which immediately popped into my head when I first heard the high level requirements for a new project geared towards processing and reporting on huge amounts of data.
We were approached by a major Australian telecommunications provider to build an enterprise solution that was capable of loading in thousands of large CSV files from Google’s cloud storage (GCS) then store, process and aggregate the data. Finally, using the aggregated results, build a web app front-end that could report on and visualize it.
By no means an overly complex project request by any stretch of the imagination. In fact, pretty standard stuff really. However, it was upon hearing about the sheer volume of data that needed to be processed that made me think of that classic movie scene from Jaws.
Brody, Hooper and Quint needed a bigger boat. We were going to need something bigger, more scalable and most importantly something much, much faster than what we currently had available in terms of RDBMS and OLAP tools.
Big data is commonly modelled on the “3Vs” and is defined by Gartner as:
“..high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”
Churning out these large files to GCS were multiple remote servers. Each CSV file was produced on the hour and around 1.5GB in size (uncompressed) and contained roughly 3 million lines/rows. Using those initial figures were then able to calculate the following averages:
- 1 hour – 3 million rows or 1.5GB.
- 1 day – 72 million rows (3mill * 24) or 36GB.
- 1 month – 2.1 billion rows (72mill * 30) or 1TB.
- 1 year – 25.2 billion rows (2.1bil * 12) or 12TB.
The reporting side of the solution would have to be able to process as much data that was historically available. The initial release needed to support a full year’s worth of historical data and then grow with each subsequent hour, day, month and year thereafter.
Sure, it wasn’t data volume on the scale of what some of the big players like IBM and Amazon handle but nevertheless I was starting to feel like Brody did when he first saw the shark.
Looking at the numbers we were confident that we ticked two out of the three “V” boxes in terms of volume and velocity but had only just smudged the variety one. Nevertheless, this was definitely shaping up to be a project worthy of being used in the same sentence as big data and at one point our client labelled it as a “game changer” within their industry.
The Right Tool
After letting those numbers sink in for a while, the team began discussing what tools and frameworks could be used in order to handle that amount of data with ease and at the same time be capable of delivering it up in a matter of seconds for the front-end to produce quick insights and reports.
We started by separating the project into two distinct parts and focused on each one individually:
- Load, store and aggregate the data.
- Query the aggregated data and visualize the results.
The team floated the idea of using Google’s BigQuery (BQ) to tackle the task at hand. None of us had used it before and we were curious to take it for a spin. We had also heard that BQ had been used in similar projects to ours and boasted impressive speeds along with being easy to use and came with a very short learning curve.
BQ comes with a convenient command line tool to import data from local CSV files. Thankfully, the tool also has the ability to pull in data from GCS directly without the need of downloading it to the client. This means greatly reduced costs in terms of bandwidth and storage as there are no intermediary steps needed for fetching and storing the data/files.
The team decided to do some more research into BQ and perform some initial tests to see if it would be able to handle what we had to chuck at it.
BQ was born out of a previous Google project called Dremel a few years back. There are plenty of good articles and tutorials out there which explore BQ in depth, discuss its Dremel roots and compare it to other OLAP tools, so I won’t go into too much detail here but rather just give a brief introduction to its basic functionality. This post is not intended to be a tutorial (or a comparison to other OLAP tools like Hadoop) but instead a case study of how we used BQ to successfully deliver this project to our client.
At a high level, BQ is an SaaS that allows businesses to store, process and analyse massive amounts of data in append only tables using a SQL type query language. It comes with plenty of options to access it such as the browser tool, a command line tool and the usual suspects of client libs for hooking into a decent REST API. There are no set up or licensing costs. Google make their money by charging for (these numbers are subject to change from Google. Check the pricing page for up to date pricing):
- Storage pricing is prorated per MB, per secon and at time of writing it is set at $0.026 per GB, per month.
- Loading data into BigQuery is free, with the exception of a small charge for streamed data. Streamed data is free until January 1, 2015. After January 1, 2015, $0.01 per 100,000 rows.
- Two types of pricing exist for querying the data – ‘on demand’ and ‘reserved capacity’
- ‘on demand’ uses a shared pool of resources across users with the first TB free of charge. Thereafter, it is charged at $5 per TB processed. You aren’t charged for queries that return an error, or for cached queries.
- ‘reserved capacity’ is geared toward larger, more consistent workloads. Reserved capacity pricing gives you the ability to reserve a certain amount of throughput each month for your queries for a set cost currently set at $20,000 per month for 5 GB per second. You can purchase one or more increments for each reservation, up to 50 GB per second.
- Two types of pricing exist for querying the data – ‘on demand’ and ‘reserved capacity’
There are the usual tiered pricing packages available which offer some decent savings and should be seriously considered if you are planning on using BQ with extremely high volumes of data. Full details of pricing can be found here.
Interactive -vs- Batch Queries
Two modes of query exist in BigQuery. They are ‘interactive’ and ‘batch’ queries. The difference is simple – interactive queries are executed immediately but have limits to the amount that can be run concurrently (20 at time of writing). Batch queries can be queued and are limitless in the number of queries that can be ‘batched’. If one fails it will be retried later, whereas an interactive query will fail immediately if the limits are hit. They have the same pricing. A good description of when you should use batch over interactive queries and the difference between can be found here.
When we initially tested using batch queries we experienced that the query would be in a ‘pending’ state for 30 minutes until it actually started executing on Google’s infrastructure. However, the result sets returned from batch queries come back just as fast as interactive queries.
Tip: It’s very easy to forget yourself when using BQ and unwittingly running up expensive bills are a real danger. During the exploratory/research phase be sure to only test on small data sets in order to keep costs down unless you have the discovery budget to play with (we did). Otherwise, you may find yourself landed with a hefty bill at the end of the month.
Queries also support the
LIMIT keyword so it’s quite easy to produce truncated tables when working with them. However, be warned – you will still be charged for the full amount of data you process and not just the number of rows you set using the
LIMIT keyword! It’s somewhat of an anti-pattern. For example:
SELECT * FROM [DataSet.a_really_big_table] LIMIT 1000
You can then write this out to a table using the GUI options in the web tool. If you want data from different parts of the table i.e. to get a better sample, you can also use the hashing algorithm as follows:
SELECT * FROM [DataSet.a_really_big_table] WHERE ABS(HASH(column1) % 10) == 1)
The Query Language Itself
BQ uses a very similar query language to SQL. If you have a decent enough understanding of SQL then you will be right at home using it. One extremely powerful feature of the language we discovered is the ability to use
REGEXP_EXTRACT. They allow you to use regular expressions directly in the the SQL to extract or match against Strings. That’s neat!
See the BigQuery query reference for full details and all features of the language.
Speed was always going to be the key factor in delivering this project successfully. We needed a tool that could grind through data sets in the billions territory without having to push the envelope and not have us heading off to grab a coffee(s) while we waited for results to come back.
When researching the performance of BQ one of the first articles we stumbled upon was a short yet brilliantly concise article from Ian Cartwright. His team had run some tests on BQ using a data set of 750 million and seen some decent results from it. We decided to run our own tests, laying down the gauntlet to BQ and using a data set with 1.5 billion rows. Things were about to get very interesting – could Google’s sales pitch of “being able to interactively analyse massive data sets with billions of rows” really do what it said it could?
It could and we were impressed. In fact, very impressed.
Even when not using cached results (cached results can be toggled on and off) we experienced consistent results in the 20-25 second range for grinding through our massive data set of 1.5 billion rows using relatively complex queries to aggregate the data.
Saving the results of the those aggregated queries to smaller tables (roughly 2 million rows), we then ran further queries on those tables to simulate the front-end requesting the results and visualizing it. We consistently got results back in 2-5 second range. Yes, this would do the job very nicely indeed.
It’s worth mentioning that Ian’s team experienced slow speeds when first querying the data:
“At the start, these were very slow, perhaps to do with some internal distributions not having happened inside of BigQuery. Initially these queries were taking a few minutes[…]”
We didn’t encounter this when running our tests. It could be possible that Google have made significant improvements/fixes over the last year or so (Ian had tested back in September 2012).
Loading & Storing The Data
Following the results of our tests the decision was made by our lead architect to run with BQ for this project. One of our first tasks was estimating the cost to the client of using it for both its storage and querying capabilities. This is made simple by BQ because the amount of data that the query will process is presented to users in the web tool prior to execution and also returned in the JSON response afterwards via the REST API (particularly good for ongoing monitoring of costs).
Once the client had signed off on the costs it was time to start loading in the data and building the aggregation tables. There are three options to load data into BQ:
- Retrieve file(s) in GCS which are either in CSV or JSON format.
- Upload a file(s) with the job which are in either CSV or JSON format.
- Use the streaming functionality
Each approach has its own limitations and quota policy. Users should read these carefully beforehand to make sure that there won’t be any nasty surprises down the line.
Because the files that we needed to load and process were already in GCS, choosing which loading option to use was of course trivial for us. The command line tools offers this functionality out of the box so we started with that.
Tip: When specifying files to load from GCS it is possible to use the wildcard (*) notation to tell BQ which files to scoop up from the cloud e.g. gs://[bucketid]/MyFile_12345_201309*
Although the command line tool worked perfectly fine, we wanted to have more control over how the files were loaded into BQ and we also wanted to build in some automatic logging/reporting on each load for tracing and reporting purposes.
Instead of using the command line tool to load in our data we decided to build our own customized Java application that could give us that level of control that we required. Hooking into the REST API, it was easy to build it and we had it up and running within a few days.
As a quick side note, the Java API is currently a release candidate and it shows at times. Although the samples and tutorials for using it are quite detailed and provide information to get you going (pay particular attention to authorization and oauth 2 sections), the documentation in the API itself is quite weak and feels like it was hastily thrown together (or most likely generated).
With our custom Java tool written and ready to roll we started to load in a year’s worth of data into BQ. Loading went off without a hitch and we sat back and watched as billions of rows started pumping into BQ from the cloud. Note: loading is free, although as already mentioned, you must pay for storage and there are strict quota policies in place.
Tip: Remember that BQ uses append only tables. You can only truncate/append. So if you make a mistake when loading and need to delete some mistakenly imported rows you’ll need to start all over with the load. However, we discovered a neat trick for not having to do this. When you query a table in BQ you have the option of saving the results to another table. What you can do is filter out any erroneous rows (or columns) you don’t want by querying the original table, save the results to a new table and delete the old one. Just make sure you have the “allow large results” option ticked. Finally, be aware that when you write the results of a query to a new table you do not have the option to use the cached results.
When you are using the client libraries (or the command line tool) the BQ the web tool is invaluable. It acts as a type of dashboard to your project(s) so you can monitor the status of all your jobs and queries in real time.
Querying The Data
Once we had the data loaded into BQ we were ready to start querying it and building our aggregation tables. This is all made possible from the web tool but once again we decided to extend our custom Java tool to allow us to have more control over the querying and aggregation functionality.
This was again very straight forward using the Java API. The queries were promptly assembled by one of our SQL gurus and plugged into the application. The tool issued each query to BQ concurrently, 20 of them in total and each query analyzing approx 750GB-1TB size tables). BQ was going to have to work hard.
Tip: If you are not writing the results of your queries to a new destination table you can use the cache function which will be faster and you won’t be billed for the query either. The description from Google explains it best:
“[Caching] attempts to use results from a previous run of this query, as long as the referenced tables are unmodified. If cached results are returned, you will not be billed for any usage. Caching cannot be enabled when a destination table is selected.”
Once issued to BQ we saw pretty much the same processing times for the queries that we had seen in our initial tests during our discovery phase. Going over the massive data sets (~2 billion rows and up to 1TB in size) was happening still in the 20-25 sec range and querying the aggregation tables was still sitting nicely in the 2-5 sec mark.
Our customized Java tool was built with automation in mind. When it was finished we deployed it to our application servers, wrapping it up in a bash script which in turn was kicked off by a cron job. The load feature was initially scheduled to run daily which would load in the previous day’s data into BQ (~72 million a day). Loading that amount of data took on average 5 minutes to flow into BQ from GCS. When loading in a month’s worth (~2 billion rows and 1TB) it took on average 30 minutes.
That’s seriously quick.
BQ isn’t perfect. It may not be for everyone when cost is a factor and it’s got a few minor annoyances like the unpolished web app, some minor limitations in the query language itself and the weak documentation of the APIs (at least in the Java one).
It’s also a Google product which may have some engineers making a bee line for the door given their previous track record for suddenly deprecating and pulling products from their shelves (case in point, Google’s old image charting library). Finally, and the one which might be a showstopper for some – it ain’t free.
That said, when compared to the costs encountered when setting up an traditional RDBMS (hardware, licensing, sysadmin staff etc.) and the ongoing license fees associated with that, using BQ may actually work out cheaper for some. The IaaS model that BQ is built on works very well considering that the off the shelf performance can run rings around a traditional setup.
It is incredibly fast and easy to use. Our data was already quite big (at least we like to think so) but we can’t help feeling that BQ has a lot more to offer and would be able to work with 100 times that amount without breaking a sweat. It’s got a short learning curve that allows for quick iterations and rapid product development. The SQL like query language is an easy transition to make for any engineer and much quicker and easier than using a MapReduce model.
Finally, we found support for BQ to be refreshingly lively and responsive. The Google support channel via Stackoverflow worked well for us. Questions we raised (like this one) were responded to in a reasonable time frame which is fantastic considering its not a paid support channel.
Warning: Spoiler alert!
In the final scenes of Jaws, Brody ends up in a battle to the death with that monstrous shark. It’s already killed off the other two members of the crew (*cough*). It’s wild, big and has a taste for human blood. But Brody ends up winning the fight because he finally used the right tool (well, in his case he technically used two tools but hey..) after some miserable attempts using metal cages, harpoons and air filled plastic barrels.
Big data is a bit like that shark. It’s also big and wild and unless you have the right tools at your disposal it can overwhelm and get the better of you. Having vast amounts of data at your finger tips is just the tip of the iceberg. It’s how you process and use that data for insights and discoveries that makes all the difference.
BQ is definitely capable of reeling in big data and it’s worth a look if you are considering using or switching to a OLAP tool for easily analyzing your big data.