OneAffiniti are a channel marketing company, creating and delivering email campaigns for over 2,500 partners across North America and Australia. They had built an initial data warehouse on AWS, pulling in data from multiple internal and 3rd party sources to gain insights on campaign performance and engagement –  but began to have scaling issues as the size of their data increased.

Shine were engaged to deliver a scalable and highly performant solution which would grow with OneAffiniti to provide marketing analytics and a single view of the customer. They also wished to use this data for future machine learning projects which include measuring partner retention.  

After performing a quick PoC demonstrating the speed of analysis of their data by running queries over some of their existing problem datasets, we were asked to build out a full solution on Google Cloud Platform.

Other client considerations included;

  • their internal team required  a solution which allowed them to query in SQL as it is what they are familiar with.
  • they wished to keep their existing BI tool – Domo

Their original environment on AWS using Aurora and MySQL RDS databases was starting to  have performance issues querying certain datasets.  An increasing amount of time was spent maintaining database performance rather than analysing data. Business intelligence dashboards were also connected to the existing data warehouse using Domo. A large amount of ETL was taking place at this presentation layer.

The following Google Cloud Platform technologies were used to build the new solution;

  • Google Cloud Storage – chosen because it is easy to trigger cloud functions off file events
  • Dataflow – serverless ETL, easy to use.
  • Cloud Functions – serverless and integrated with GCS and Dataflow
  • BigQuery – managed service, fast, petabyte scale analysis.
  • StackDriver – easy to get Google API calls logged to StackDriver and generate alerts/metrics on those (for example, getting alerts if a dataset’s permissions have changed).

RESTful API calls captured incremental data from 3rd party systems directly. API calls were written in scripts hosted on AppEngine Standard instance. This architecture takes advantage of Google AppEngine’s scheduler, storage, and scale-down-to-zero, thereby costing nothing when the APIs are not executed. For the little duration when APIs are executed, much traffic is not generated. The incrementally fetched data is very small in size and is also deleted from storage once it is successfully loaded into BigQuery. 

We used scheduling service on AppEngine to execute the API calls periodically. API calls return data in JSON format, and data so captured is directly inserted into corresponding BigQuery tables. Reference data is loaded directly into corresponding BQ tables via spreadsheets in CSV format, that is maintained by the business stakeholders.

DataFlow pipelines written in Python. These pipelines perform ETL on raw data and combine with data from different sources as well as reference data. Data thus transformed (split, joined, enriched) is finally loaded into output BigQuery tables. These tables contain data that has been transformed.

BigQuery views are used  to finally expose the desired subset of transformed data to external BI tools such as Domo using its BigQuery connector. Views also allow us to impose restrictions and permissions to prevent runaway costs that large queries can sneakily rake-up if not controlled. The business executes complex queries on BigQuery to perform exploratory data discovery, ad-hoc reporting, and general inspection of data to draw business insights via plain SQL.

Building out the full solution on Google Cloud Platform saw many customer benefits including;

  • Extremely fast analysis (and deeper view of customer can now be analysed as some performance constraints of old data warehouse are now removed).
  • Simplified ETL process with all business logic now in one place (stored as BigQuery views).
  • Little to no time spent managing database infrastructure to try and manage performance.
  • Significant operational cost savings.