Big Data Tag

Introduction

It’s a simple question, often asked by project managers, data scientists, and quality engineers on every data engineering project when that first data source is ingested. How do we know the data that has been ingested into a data lake is accurate and error-free?

  Shine's very own Pablo Caif will be rocking the stage at the very first YOW! Data conference in Sydney. The conference will be running over two days (22-23 Sep) and is focused big data, analytics, and machine learning. Pablo will give his presentation on Google BigQuery,...

contrailscience.com_skitch_skitched_20130315_131709 One of the projects that I'm currently working on is developing a solution whereby millions of rows per hour are streamed real-time into Google BigQuery. This data is then available for immediate analysis by the business. The business likes this. It's an extremely interesting, yet challenging project. And we are always looking for ways of improving our streaming infrastructure. As I explained in a previous blog post, the data/rows that we stream to BigQuery are ad-impressions, which are generated by an ad-server (Google DFP). This was a great accomplishment in its own right, especially after optimising our architecture and adding Redis into the mix. Using Redis added robustness, and stability to our infrastructure.  But – there is always a but – we still need to denormalise the data before analysing it. In this blog post I'll talk about how you can use Google Cloud Pub/Sub to denormalize your data in real-time before performing analysis on it.

play_full_color

It’s an established trend in the modern software world that if you want to get something done, you'll probably need to put together a web service to get do it. People expect data and services to be available everywhere, in a mobile world. With the plethora of frameworks and technologies available to go about implementing a web service, it becomes a chore to try using anything beyond what's already familiar. But every now and then it’s an enjoyable experience to dive into something new and distinctly unfamiliar.

Shine is proud to have been awarded the Computerworld Data+ award for our work with Google BigQuery.  The innovative work is a great example of using great technology to deliver business benefit.  You can see the write up of the award here: http://www.computerworld.com/article/2598539/shine-technologies.html Congratulations to the Shiners...

Saturday-Night-Fever-2

The Kick-Off Meeting

It went something along the lines of:
  • Client: "We have a new requirement for you.."
  • Shiners: "Shoot.."
  • Client: "We'd like you to come up a solution that can insert 2 million rows per hour into a database and be able to deliver real-time analytics and some have animated charts visualising it. And, it should go without saying, that it needs to be scalable so we can ramp up to 100 million per hour."
  • Shiners: [inaudible]
  • Client: "Sorry, what was that?"
  • Shiners: [inaudible]
  • Client: "You'll have to speak up guys.."
  • Shiners: "Give us 6 weeks"
We delivered it less than 4.

ren_and_stimpy_by_buttercupnergal

“In our (admittedly limited) experience, Redis is so fast that the slowest part of a cache lookup is the time spent reading and writing bytes to the network” - stackoverflow.com

Can Databases Be Exciting To Work With?

It’s very rare that a project can cause an engineer to get excited about the prospect of working with a database they've never worked with previously, especially when it’s a relational one. That mainly boils down to the fact that the majority of them are clunky monstrosities that are painfully slow and cause us to grimace at the thought of having to integrate them into our applications, not to mention having to piece together gnarly and over engineered SQL statements.