20 Feb 2023 Is the data ingested in your data lake correct?
It’s a simple question, often asked by project managers, data scientists, and quality engineers on every data engineering project when that first data source is ingested. How do we know the data that has been ingested into a data lake is accurate and error-free?
With some ingestion mechanisms, assurance can be obtained by proxy. For instance, if we’re ingesting messages in real-time and we know what structure the message is in, then we can with integration tests, check that a pre-defined message ends up in a table as expect.
But what about if, as is often the case, we need to ingest a stream of changes from an on-premise relational database using CDC (Change Data Capture)?. How would we know if the data ever gets out of sync?
Let’s not get too bogged down in the details of how we’re moving the data. It could be AWS DMS , Oracle’s Golden Gate, or something like ReplicaDB . Each of these will validate the first time, when they complete the initial full load, but on-going becomes trickier.
A simple approach could be to write a script which checks the number of rows on each side of the replication, and check that they are the same. Whilst this is fine for event tables, where the primary action is append, it wouldn’t, for instance, catch if we’d missed an update to an existing row within the table. Nor would it catch if we had deleted 1 row, and added another, and both had been missed, because the total count of rows in the table would remain unchanged.
Your next thought might be to read all the data and do a row by row comparison, but this is far too time and resource consuming to be done on all but the smallest of data sets. So If you had something critical that must be in sync, you might consider it but otherwise it might be too expensive.
There are a couple of tools that aim to do data validation, such as datacompy and deequ, both of which are largely formalisations of the above, wrapped in pandas and spark respectively. While they are both super convenient, and will highlight exactly which rows are missing, they both need to process all of the data in each table twice (once for each side of the replication)
There is a halfway house. Rather than checking for exact matches on the data, we can check features of the data and verify that the data “looks similar” overall. Great Expectations can do just that out of the box. I’ve setup a quick test project with some databases, Great Expectations, Datacompy, and a few other tools for this exact purpose [https://github.com/aaronbrown1988/great_expectations_tests]. What Great Expectations does, and how I’ve configured it, is to generate a profile of the data on the source (in this case the mysql instance), and verify that it’s the same on the destination, alerting us to any failed expectations. This approach may not tell us specifically if we’ve missed an update, but it’s much cheaper than the full comparison. A lot of the queries can be pushed down to the underlying database so we don’t have to process all data again.
How do the different approaches compare with each other? I built a quick set of tests to find out. I created a short data set of 3 rows, with an index, a string field and a date field and populated them with some sample data. Here’s the exact table definition can be found here [https://github.com/aaronbrown1988/great_expectations_tests/blob/main/mysql_raw/test(1).sql] if you’re interested. I then simulated missing an update, an insert, a delete, and even missing a combination of an insert and a delete, to see which of the different scenarios each method was able to detect. The results are as follows:
|Missing Update (Text Field)||❌||✅|
|Missing Update (Date Field)||✅||✅|
|Missing Insert and Missing Delete||✅||✅|
|1-10% Error Rate in Text Field (10K records)||✅||✅|
|0.1 – 1% missed updates in Text Field (10 K records)||✅||✅|
|1 Million records, 1% missing updates (time taken in seconds)||33||74|
Overall the theoretically less accurate approach using Great Expectations seems to perform pretty well when compared the more memory / I/O intensive approach of using DataCompy. Great Expectations only failed to detect the missed update for the text field. This is likely because of the statistical approach that it takes, which means that it captures fewer metrics for varchar fields that don’t have an index or a constraint. The DataCompy approach can tell you exactly which row is wrong, and what is missing, at a cost of processing and comparing all of the records every time. This would be invaluable in trying to build a process to correct for the error on the fly.
Going forward I’ll build Great Expectations into any data engineering project to provide a level of assurance that the data that has been ingested is accurate. It will also provide a `canary in the coal mine‘ for any potential ingestion problems. I will be only falling back to Datacompy for critical datasets, or to resolve errors automatically.