Cloud Tag

Quite a while back, Google released two new features in BigQuery. One was federated sources. A federated source allows you to query external sources, like files in Google Cloud Storage (GCS), directly using SQL. They also gave us user defined functions (UDF) in that release too. Essentially, a UDF allows you to ram JavaScript right into your SQL to help you perform the map phase of your query. Sweet! In this blog post, I'll go step-by-step through how I combined BigQuery's federated sources and UDFs to create a scalable, totally serverless, and cost-effective ETL pipeline in BigQuery. Last week, Shine's very own Pablo Caif gave a presentation at GCP Next 2016 in San Francisco, which is Google’s largest annual cloud platform event. Pablo delivered an outstanding talk on the work Shine have done for Telstra, which involves building solutions on the GCP stack to manage and analyse their massive datasets. More specifically, the talk focused around two of Google’s core big data products –BigQuery & Cloud Dataflow.
contrailscience.com_skitch_skitched_20130315_131709 One of the projects that I'm currently working on is developing a solution whereby millions of rows per hour are streamed real-time into Google BigQuery. This data is then available for immediate analysis by the business. The business likes this. It's an extremely interesting, yet challenging project. And we are always looking for ways of improving our streaming infrastructure. As I explained in a previous blog post, the data/rows that we stream to BigQuery are ad-impressions, which are generated by an ad-server (Google DFP). This was a great accomplishment in its own right, especially after optimising our architecture and adding Redis into the mix. Using Redis added robustness, and stability to our infrastructure.  But – there is always a but – we still need to denormalise the data before analysing it. In this blog post I'll talk about how you can use Google Cloud Pub/Sub to denormalize your data in real-time before performing analysis on it.

My work commute

My commute to and from work on the train is on average 17 minutes. It's the usual uneventful affair, where the majority of people pass the time by surfing their mobile devices, catching a few Zs, or by reading a book. I'm one of those people who like to check in with family & friends on my phone, and see what they have been up to back home in Europe, while I've been snug as a bug in my bed. Stay with me here folks. But aside from getting up to speed with the latest events from back home, I also like to catch up on the latest tech news, and in particular what's been happening in the rapidly evolving cloud area. And this week, one news item in my AppyGeek feed immediately jumped off the screen at me. Google have launched yet another game-changing product into their cloud platform big data suite. It's called Cloud Dataproc.


It’s an established trend in the modern software world that if you want to get something done, you'll probably need to put together a web service to get do it. People expect data and services to be available everywhere, in a mobile world. With the plethora of frameworks and technologies available to go about implementing a web service, it becomes a chore to try using anything beyond what's already familiar. But every now and then it’s an enjoyable experience to dive into something new and distinctly unfamiliar.

Bq_tOGxCMAELB4k Back in June 2014, at the annual Google IO in San Francisco, Google unveiled their newest, and much hyped cloud product, Cloud Dataflow. The demo they did that day, using a live twitter feed to analyze supporter sentiment during the 2014 world cup, got my mouth watering at the prospect of working with it. It looked downright freaking awesome, and I just couldn't wait to get my hands on it to take it for a spin.

The Kick-Off Meeting

It went something along the lines of:
  • Client: "We have a new requirement for you.."
  • Shiners: "Shoot.."
  • Client: "We'd like you to come up a solution that can insert 2 million rows per hour into a database and be able to deliver real-time analytics and some have animated charts visualising it. And, it should go without saying, that it needs to be scalable so we can ramp up to 100 million per hour."
  • Shiners: [inaudible]
  • Client: "Sorry, what was that?"
  • Shiners: [inaudible]
  • Client: "You'll have to speak up guys.."
  • Shiners: "Give us 6 weeks"
We delivered it less than 4.
Image I was lucky enough to be one of the 6,000 cloud geeks that descended on Vegas last week to attend AWS re:Invent 2012. This inaugural AWS developer conference was broken into 3 days. The first day was a bit of a warm-up day, with technical workshops and a AWS partner day. The two subsequent days had keynotes and deep-dive sessions covering all elements of the AWS ecosystem. In this post I'll cover what I saw during the three days I was there, and what had the biggest impact on me.