Data Lineage
The map of where data comes from and where it goes: upstream sources, transformations, downstream consumers. Essential for trust, impact analysis, and compliance.
is the family tree of your data. For any table, column, or dashboard, you can answer: **where did this come from?** (upstream) and **what depends on it?** (downstream).
Why does it matter? Three concrete moments you need it: **trust** — when someone questions a number, you trace it back to the source; **impact analysis** — before you change a column in the source, you know which dashboards break; **compliance** — for , you prove that personal data flows stay within approved systems.
Two levels of granularity: **dataset-level** (the customers table comes from the CRM) — enough for high-level navigation. **** — (this revenue number is the sum of orders.amount after a currency conversion) — required for root-cause analysis and for GDPR proof.
Lineage is captured either automatically (by parsing queries, dbt manifests, Spark logs, Airflow DAGs — the pipeline tells you) or manually (in a data catalog) — manual always goes stale. Prefer automatic every time.
Tools: (open standard for emitting lineage events), Marquez / DataHub / Atlan / Collibra (catalogs that ingest and visualize). does column-level lineage natively inside its graph.
Diagram
Grounded on https://openlineage.io/
Next up
Data Catalog & Metadata
The searchable yellow-pages of your data estate: what datasets exist, what they mean, who owns them, and where they come from.