Any ideas on how to automate data quality assessments for a data engineering pipeline?
I am concerned about:
- Completeness : Difference in percentage between a measure computed in the source system and the same measure computed in the target system.
- Freshness : Difference in minutes between the current timestamp and the last updated timestamp in the target system
- Latency : Difference in minutes between the last updated timestamp in the source system and the last updated timestamp in the target system.
- Validity : Any measure computed in target system which does not comply with a validation or business rules.
An Ex Machina template could then be applied to any data for analysis, then send out automatic emails to stakeholders on any issues