Big Data

Introduction

It’s a simple question, often asked by project managers, data scientists, and quality engineers on every data engineering project when that first data source is ingested. How do we know the data that has been ingested into a data lake is accurate and error-free?

Building regulatory reporting in the cloud offers many benefits, some of which are unparalleled speed, massive automatic scalability, tight security, very low initial setup cost, flexible operational costs, minimal human intervention, massive drive towards automation, and much more....

Weather forecast is a complicated process. If you live in an area with lots of oscillation in weather like us in Melbourne, you should always give some chance for the weather to be different from what you see on websites. The weather is typically forecasted by first gathering a lot of information about the atmosphere, humidity, wind, etc. and then relying on our atmospheric knowledge and a physical model to articulate changes in the near future. But due to our limited understanding of the physical model and the chaotic nature of the atmosphere, it might be unreliable. Instead of the common approach for this, here we try to scrutinise the idea of entrusting a machine learning model for this purpose. We expect the model to look at the historical data and get a feeling of how the temperature will change in near future, let's say tomorrow.
Due to me being kind of a big deal around here, I was sent to Google Next 18 last week. It's a two-and-a-half-day conference in San Francisco, all about Google Cloud. I made some exciting discoveries, which I will share with you, and also went to some talks or something.

Some background

When we started using Google BigQuery - almost five years ago now - it didn't have any partitioning functionality built into it.  Heck, queries cost $20 p/TB back then too for goodness' sake!  To compensate for this lack of functionality and to save costs, we had to manually shard our tables using the well known _YYYYMMDD suffix pattern just like everyone else.  This works fine, but it's quite cumbersome, has some hard limits, and your SQL can quickly becomes unruly. Then about a year ago, the BigQuery team released ingestion time partitioning.  This allowed users to partition tables based on the load/arrival time of the data, or by explicitly stating the partition to load the data into (using the $ syntax).  By using the _PARTITIONTIME pseudo-column, users were more easily able to craft their SQL, and save costs by only addressing the necessary partition(s).  It was a major milestone for the BigQuery engineering team, and we were quick to adopt it into our data pipelines.  We rejoiced and gave each other a lot of high-fives.

Shine's good friend Felipe Hoffa from Google was in Melbourne recently, and he took the time to catch up with our resident Google Developer Expert, Graham Polley. But, instead of just sitting down over a boring old coffee, they decided to take an iconic tram ride...

Here in Australia, we do a lot of work on Google Cloud Platform for one of the country’s largest ISPs, Telstra. Most of that work involves building data pipelines and running analytics off the back of them for their Media business unit. As you can well imagine, they generate a huge amount of data on a daily basis. We use tools like BigQuery, Cloud Dataflow and Data Studio to wrangle, manage, and understand that data. On one such project for Telstra, we saw an opportunity to delete three code repositories and finally rid ourselves of some of the headaches associated with maintaining those applications, all the while saving money on the operational costs. We were able to replace the system comprising these repos with two new Google Cloud Platform services: In this blog post, I’ll introduce you to those new services that Google have spun up, and how we were able to use them to replace our legacy applications. Who doesn’t like a good spring clean, huh?