Data Catalog & Metadata

The searchable yellow-pages of your data estate: what datasets exist, what they mean, who owns them, and where they come from.

Easy Technical

1 min read

A is Google + Wikipedia for your data. Anyone in the company can search 'revenue', see the 8 tables that have something to do with revenue, read what each one means, who owns it, when it was last updated, and how trustworthy it is.

Without a catalog, new hires spend weeks asking 'where is X?' and analysts build duplicate tables because they don't know someone already built one. Time lost, money lost, trust lost.

A catalog stores three kinds of : technical (schema, column types, sizes, partitions), business (what does this dataset mean? who owns it? is it reliable?), operational (when was it last refreshed, quality score, usage stats).

Modern catalogs also have: search, graphs, quality metrics, glossary (shared business terms — 'revenue' means THIS, not that), certification badges ('Gold-level reviewed dataset'), usage analytics (which tables matter).

Tools: Open source — DataHub (LinkedIn), OpenMetadata, Amundsen (Lyft), Marquez. Commercial — Collibra, Alation, Atlan, Select Star, Informatica CDGC. Cloud-native — Unity Catalog (Databricks), Dataplex (GCP), Purview (Azure), Lake Formation (AWS).

Grounded on https://datahub.com/

Next up

Data Ownership & Stewardship (RACI)

Who's accountable for each dataset vs who does the day-to-day work. Clear roles kill the 'whose problem is this?' paralysis.