Data Lineage

The map of where data comes from and where it goes: upstream sources, transformations, downstream consumers. Essential for trust, impact analysis, and compliance.

Easy Technical

1 min read

is the family tree of your data. For any table, column, or dashboard, you can answer: **where did this come from?** (upstream) and **what depends on it?** (downstream).

Why does it matter? Three concrete moments you need it: **trust** — when someone questions a number, you trace it back to the source; **impact analysis** — before you change a column in the source, you know which dashboards break; **compliance** — for , you prove that personal data flows stay within approved systems.

Two levels of granularity: **dataset-level** (the customers table comes from the CRM) — enough for high-level navigation. **** — (this revenue number is the sum of orders.amount after a currency conversion) — required for root-cause analysis and for GDPR proof.

Lineage is captured either automatically (by parsing queries, dbt manifests, Spark logs, Airflow DAGs — the pipeline tells you) or manually (in a data catalog) — manual always goes stale. Prefer automatic every time.

Tools: (open standard for emitting lineage events), Marquez / DataHub / Atlan / Collibra (catalogs that ingest and visualize). does column-level lineage natively inside its graph.

Diagram

CRM API (Salesforce)

raw.customers

staging.customers (dedupe + normalize)

facts.revenue_by_customer

dims.customer

Revenue dashboard (Looker)

Churn ML model

Grounded on https://openlineage.io/

Next up

Data Catalog & Metadata

The searchable yellow-pages of your data estate: what datasets exist, what they mean, who owns them, and where they come from.