Data Catalog & Metadata

The searchable yellow-pages of your data estate: what datasets exist, what they mean, who owns them, and where they come from.

Easy Technical

1 min read

A data catalog is a metadata repository with discovery, context, and governance capabilities. It ingests metadata from sources (databases, pipelines, BI tools, ML platforms) and indexes it for search, browse, and programmatic access.

Metadata taxonomy: technical (schema, column types, nullability, size, partition scheme, storage location); business (definition, domain, owner, , criticality, tags, glossary terms); operational (refresh frequency, SLA, quality scores, last-run timestamp); social (comments, question threads, endorsements, usage statistics); lineage (upstream/downstream graph).

Architecture: most modern catalogs use a pull + event-driven hybrid — scheduled crawlers scan sources for structure, while event emitters (OpenLineage, audit logs) push runtime facts (who ran what query when, which dashboards consumed which table).

Glossary vs catalog: the is the authoritative definitions of terms ('Annual Recurring Revenue = ...'). The catalog ties physical assets (tables, columns) to glossary terms. A well-run glossary is the 'rosetta stone' between legal/finance/engineering speaking of 'customer' meaning 3 different things.

Integration points: authentication (SSO to the IdP), permissions (IAM for access to metadata, not to the underlying data), APIs (search, fetch asset by URN, emit lineage). Most catalogs expose REST + GraphQL + a streaming API (Kafka-like topic for metadata events).

Key success factors: (a) broad source coverage (warehouse + lakes + BI + ML + streaming + on-prem) — partial coverage breeds distrust; (b) curation loop — technical metadata auto-harvested, business metadata crowd-sourced from stewards, quality enforced via certification; (c) embedded in workflow — integrate with IDEs, BI tools, query runners so the catalog surfaces where engineers already are.

Anti-patterns: (i) 'build it and they will come' — catalog rot if no one curates business metadata; (ii) one-shot import — metadata from 2023 that nobody refreshed; (iii) siloed per-team catalog — defeats the unified-search value prop.

Grounded on https://datahub.com/

Next up

Data Ownership & Stewardship (RACI)

Who's accountable for each dataset vs who does the day-to-day work. Clear roles kill the 'whose problem is this?' paralysis.