Automatic Data Quality Check


Any ideas on how to automate data quality assessments for a data engineering pipeline?

I am concerned about:

  • Completeness : Difference in percentage between a measure computed in the source system and the same measure computed in the target system.
  • Freshness : Difference in minutes between the current timestamp and the last updated timestamp in the target system
  • Latency : Difference in minutes between the last updated timestamp in the source system and the last updated timestamp in the target system.
  • Validity : Any measure computed in target system which does not comply with a validation or business rules.

An Ex Machina template could then be applied to any data for analysis, then send out automatic emails to stakeholders on any issues



@AlexXuConEd Great question!

@boaz.stossel is working down some of these items from the products side, and @aaileni has implemented a few data load alerting mechanisms, leveraging Splunk, I believe. They each may have more details to share with you on roadmapped features and pieces of this that you can implement in the near-term.

1 Like


Thanks Bus!

Any ideas on how we can best utilize the current nodes/functionality to check some business rules of new data loads?

  • Flag numeric column if value(s) are above/below threshold that’s either set by user or a data-driven method (ie 2-3 standard deviations above/below mean)

I am thinking I could just use “Filter by SQL” node, then output the results into a csv manually, for now

Ideally, I would like it to program these rules, run the process automatically, then check if any records gets flagged.



To implement this programmatically, you need to first translate your data quality requirements to specific metrics and success thresholds. The C3 AI Suite lets you control the shape of your data throughout the Extract, Transform and Load process, meaning you could chose to not persist data you think isn’t “valid”. For data that is valid to persist, you may define simple and compound metrics, methods and/or map reduce jobs on the set of relevant types. These jobs may be triggered by DFE’s on invalidation or at a schedule based on the requirements. You should probably develop reliable heuristics though as the cost to run these jobs on all available data may at times be greater than that to persist. To answer your specific question, check out the suite of ML nodes in Ex-machina. You may try the K-means to cluster numerical and/or categorical fields or even entities for analysis. Good luck !

1 Like