Big Data

It's no secret that when it comes to building data platforms, I've spent a fair amount of time building cloud provider native solutions, both in AWS and GCP. So when I saw that Snowflake had a conference showcasing all their latest and greatest features, I...

Recently, I got involved in an cloud application uplift project. As part of the project we wanted to use an Aurora PostgreSQL database to offload all read activities from a RDS Oracle Database Standard Edition.Since this was such an exciting and educational experience, I wanted...

Introduction

It’s a simple question, often asked by project managers, data scientists, and quality engineers on every data engineering project when that first data source is ingested. How do we know the data that has been ingested into a data lake is accurate and error-free?

Building regulatory reporting in the cloud offers many benefits, some of which are unparalleled speed, massive automatic scalability, tight security, very low initial setup cost, flexible operational costs, minimal human intervention, massive drive towards automation, and much more....

No food reviews here I'm afraid

This year I was incredibly lucky to score a coveted ticket to YOW! in beautiful Melbourne. I was also asked to be a track host for a couple of sessions, so that was quite an honour too. This post is a whirlwind wrap-up of the conference, and only includes my favourite talks from the two day event. If you're hoping to hear detailed reviews on how the coffee/food/WiFi/venue was, then you'll be greatly disappointed (it was all great BTW).

 Shine's very own Pablo Caif will be rocking the stage at the very first YOW! Data conference in Sydney. The conference will be running over two days (22-23 Sep) and is focused big data, analytics, and machine learning. Pablo will give his presentation on Google BigQuery,...

contrailscience.com_skitch_skitched_20130315_131709One of the projects that I'm currently working on is developing a solution whereby millions of rows per hour are streamed real-time into Google BigQuery. This data is then available for immediate analysis by the business. The business likes this. It's an extremely interesting, yet challenging project. And we are always looking for ways of improving our streaming infrastructure.As I explained in a previous blog post, the data/rows that we stream to BigQuery are ad-impressions, which are generated by an ad-server (Google DFP). This was a great accomplishment in its own right, especially after optimising our architecture and adding Redis into the mix. Using Redis added robustness, and stability to our infrastructure.  But – there is always a but – we still need to denormalise the data before analysing it.In this blog post I'll talk about how you can use Google Cloud Pub/Sub to denormalize your data in real-time before performing analysis on it.