Dualo
Data Governance

Data Quality

The dimensions (accuracy, completeness, consistency, timeliness, validity, uniqueness) that make data trustworthy — and how to measure and fix them.

1 min read

The six canonical dimensions (DAMA): Accuracy — does the data correctly represent the real-world entity? Completeness — are all required attributes populated? Consistency — does the same fact agree across systems? Timeliness — is the data available within the expected window? Validity — does the data conform to defined formats/domains? Uniqueness — is the same entity represented only once?

Measurement approach: each dimension produces metrics (e.g., completeness % = non-null rows / total rows; uniqueness = 1 - duplicate_rate). Aggregate to a scorecard per dataset + per critical business element. Set acceptance thresholds — e.g., customer.email completeness ≥ 98%.

Testing tooling: dbt tests (unique, not_null, accepted_values, relationships) for SQL transformations; Great Expectations for Python pipelines (~300 expectation types covering nulls, ranges, distributions, column existence); Soda (YAML-first, batch + streaming); Monte Carlo / Bigeye / Anomalo for ML-driven anomaly detection on freshness, volume, schema, distribution drift.

Root-cause fixing: quality issues surface at the analytics layer but originate upstream. Classic pattern: a mobile app starts sending country codes as 'FR' instead of 'FRA' → breaks reports. Fix = contract at the producer boundary (a schema registry + validation + rejection or dead-letter queue), not a SQL CASE statement in the warehouse.

formalize producer/consumer expectations as explicit, versioned artifacts (schema + semantic constraints + SLA). When a producer wants to change, they fail the CI build of downstream tests, forcing coordination before break.

vs testing: tests are known expectations you wrote (deterministic). Observability uses historical baselines to flag unknown-unknowns (row count dropped 40% overnight, NULL rate on `amount` jumped). Most mature setups do both.

Grounded on https://www.dama.org/

Next up

Data Lineage

The map of where data comes from and where it goes: upstream sources, transformations, downstream consumers. Essential for trust, impact analysis, and compliance.