Data Classification

Labeling every dataset by its sensitivity so the right controls (access, encryption, retention) apply automatically.

1 min read

Classification is the foundation of most downstream controls — access policies, encryption keys, (Data Loss Prevention) rules, data residency enforcement, retention policies, and audit scope all hinge on a data asset's classification label.

Typical 4-tier schema: Public (intended for external distribution), Internal (default for employees, no material damage if leaked), Confidential (business-sensitive — financials, strategy, IP; limited distribution), Restricted / Regulated (personal data under GDPR, PHI under HIPAA, PCI cardholder data, authentication secrets — breach triggers legal notification). Some orgs add a 'Secret' tier for crown-jewel IP.

Classification has two flavors: (a) Static classification — the dataset's nature (salaries = confidential, customer PII = restricted); (b) Contextual classification — same content, different sensitivity (employee first name = low; first name + SSN = restricted due to combination). Tooling needs to detect both.

Implementation patterns: tag data at catalog level (Collibra, DataHub, Atlan) with a required 'classification' field; propagate via (derived tables inherit the max class of their sources); enforce in access systems (Snowflake masking policies, BigQuery data policy tags, Lake Formation LF-tags) that read the classification and apply dynamic .

Automated discovery: scanners (BigID, Varonis, native cloud scanners) crawl datasets, run regex + ML classifiers, and propose classifications for review. Critical for estates with thousands of tables where manual tagging is unrealistic. Treat auto-classifications as proposals requiring steward validation — not as truth.

Common pitfalls: (i) over-classifying (everything becomes Confidential, controls become noise); (ii) no escalation path when a dataset's sensitivity changes (e.g., a previously-internal field now contains PII); (iii) classification stuck in a document nobody reads instead of embedded in tooling.

Grounded on https://www.isaca.org/resources/news-and-trends/industry-news/2020/data-classification

Next up

Data Quality

The dimensions (accuracy, completeness, consistency, timeliness, validity, uniqueness) that make data trustworthy — and how to measure and fix them.