Data Lineage
The map of where data comes from and where it goes: upstream sources, transformations, downstream consumers. Essential for trust, impact analysis, and compliance.
Lineage is a directed graph: nodes are data artifacts (tables, views, dashboards, ML features, files), edges are producer→consumer relationships carrying transformation metadata. Three standard granularities: **dataset-level**, **column-level**, **row-level** (rarest — needed for forensic privacy investigations).
**OpenLineage** is the emerging open standard: a runtime-emitted JSON event describing a job run (inputs, outputs, facets like schema/columnLineage/quality). Integrations with Airflow, Spark, dbt, Flink. Storage backend (Marquez) or catalog (DataHub, OpenMetadata) consumes the events.
Automatic capture mechanisms: (a) Query parsing — SQL/transformation engines emit lineage at execution time (Snowflake ACCESS_HISTORY, BigQuery INFORMATION_SCHEMA.JOBS, dbt manifest.json); (b) Runtime interception — wrappers around Spark / Pandas / Airflow that emit events on read/write; (c) Log parsing — post-hoc analysis, lossy, last-resort.
Column-level specifics: harder to extract because SQL dialects vary (CTEs, subqueries, UDFs, JSON extractions) and some transformations are opaque (SELECT *, Python UDFs, stored procedures). Column-level lineage for 95% coverage typically requires modern parsers (sqllineage, OpenLineage's SQL facet) plus hand-curated mappings for the gap.
Impact analysis workflow: before deprecating or changing a source column, query the graph for all transitively-dependent artifacts, classify by criticality (dashboards executive sees daily vs ad-hoc query from 6 months ago), and notify/coordinate with owners. Mature orgs block the change in CI if downstream tests break.
GDPR use case: for a data-subject access request, lineage proves all locations where a person's data has flowed, so you can deliver access/erasure correctly. Without it, you're guessing — and regulators don't love guessing.
Common pitfalls: (i) manual documentation — rots within weeks; (ii) lineage silos — warehouse lineage in tool A, ML lineage in tool B, dashboard lineage in tool C, and no one joins them; (iii) captures but never consumed — nice graphs nobody queries.
Diagram
Grounded on https://openlineage.io/
Next up
Data Catalog & Metadata
The searchable yellow-pages of your data estate: what datasets exist, what they mean, who owns them, and where they come from.