To Partition or not to Partition

I have been using BigQuery for over 2 years now at Shine. I’ve found it to be a great tool that is both incredibly fast and able to handle some of our largest workloads. We are processing terabytes of data per day, and each day an extra billion records are added to the store.

But unfortunately this growth is also increasing our costs of running queries. While BigQuery is extremely fast and parallel, it comes at the cost of needing to scan and pay for every record of the columns you are querying. Without the indexes offered by conventional databases, a full table scan is needed for each query. Not only that but when you query large amounts of data the speed of your query slows down:In this post I’ll talk about how we used table partitions to increase the performance of our queries and avoid query slowdowns.

Cloud Next 2017 – Shifting to the Cloud

Last week I had the privilege of attending Google Cloud Next in San Francisco. With Google finally due to open a datacenter in Australia this year, it was certain to be a great opportunity to learn about what’s next with Google Cloud.

From the moment I arrived at the baggage carousel at San Francisco International Airport, I was swamped with advertising for the conference. It was clear that Google is really pushing their cloud platform to as many developers as possible. This left me really excited for what was about to come over the following week. In this post I’m going to try and sum up how it all went.

Will Athena slay BigQuery?

*Updated on 16th December 2016 – see below

With the announcement of Amazon Athena at this year’s AWS re-invent conference, I couldn’t help but notice its striking similarity with another rival cloud offering. I’m talking about Google’s BigQuery. Athena is a managed service allowing customers to query objects stored in an S3 bucket. Unlike other AWS offerings like Redshift, you only need to pay for the queries you run. There is no need to manage or pay for infrastructure that you may not be using all the time. All you need to do is define your table schema and reference your files in S3. This works in a similar way to BigQuery’s federated sources which reference files in Google Cloud Storage.

Given this, I thought it would be interesting to compare the two platforms to see how they stack up against each other. I wanted to find out which one is the fastest, which one is more feature rich and which is the most reliable.

High availability, low latency streaming to BigQuery using an SQS Queue.

When you have a Big Data solution that relies upon a high quality, uninterrupted stream of data for it to meet the client’s expectation you need a solution in place that is extremely reliable and has many points of fault tolerance. That all sounds well and good but how exactly does that work in practice?

Let me start by explaining the problem. About 2 years ago our team was asked to spike a streaming service that could stream billions of events per month to Google’s BigQuery. The events were to come from an endpoint on our existing Apache web stack. We would be pushing the events to BigQuery using an application written in PHP. We did exactly this, however, we were finding that requests to BigQuery were taking too long and thus resulted in slow response times for users. So we needed to find a solution to Queue the events before sending them to BigQuery.

NoSQL in the cloud: A scalable alternative to Relational Databases

cloud-db.jpg

With the current move to cloud computing, the need to scale applications presents itself as a challenge for storing data. If you are using a traditional relational database you may find yourself working on a complex policy for distributing your database load across multiple database instances. This solution will often present a lot of problems and probably won’t be great at elastically scaling.

As an alternative you could consider a cloud-based NoSQL database.  Over the past few weeks I have been analysing a few such offerings, each of which promises to scale as your application grows, without requiring you to think about how you might distribute the data and load.

Building a file explorer on top of Amazon S3

Amazon S3 is a simple file storage solution that is great for storing content, but how well does it stack up when used as the storage mechanism for a web-based file explorer?

Recently I was tasked with doing just this for a client. Furthermore, as opposed to the existing solution (which used CKFinder and synchronised copies of the files between our own server and the bucket), I needed to connect to an S3 bucket directly. In this post I’ll talk about how we did it.