Data Contracts¶
Executive Summary¶
- Data contracts are the interface layer between data producers and consumers. They define what a data product delivers, how it behaves, and what guarantees it makes.
- A contract specifies schema, quality expectations, SLAs, ownership, and evolution rules -- everything a consumer needs to build against a data product without knowing the producer's internals.
- Without contracts, every downstream consumer is coupled to the producer's internal implementation. One column rename, one type change, one shifted schedule -- and things break silently.
- Contracts are the mechanism that makes EDP and operational platform coexistence work. They formalize the handoff points described in EDP vs Operational.
- They are not optional. They are the difference between "we have data products" and "we have shared tables that break."
What a Data Contract Contains¶
A data contract is a structured, machine-readable specification. It covers six areas.
Schema Definition¶
The exact shape of the data being delivered:
- Column names and data types
- Nullability constraints
- Column descriptions (business meaning, not just technical names)
- Primary key and uniqueness declarations
- Nested structure definitions for complex types
Quality Expectations¶
What "good data" means for this product:
- Freshness SLA -- maximum acceptable age of the most recent record
- Completeness -- minimum percentage of non-null values for critical columns
- Uniqueness -- columns or column combinations that must be unique
- Value constraints -- valid ranges, allowed enum values, referential integrity checks
- Volume expectations -- expected row count ranges per refresh (catches silent upstream failures)
Ownership¶
Who is accountable when something goes wrong:
- Producing team name and org unit
- Primary contact (team channel, not an individual)
- Escalation path for SLA breaches
- On-call rotation reference (if applicable)
SLA¶
The operational guarantees the producer commits to:
- Refresh frequency (hourly, daily, event-driven)
- Maximum staleness window
- Availability target (e.g., 99.5% for analytical products, 99.9% for serving endpoints)
- Latency bounds for query/serving endpoints
Evolution Rules¶
How the contract changes over time:
- Additive changes (new columns) are safe and do not require consumer coordination
- Breaking changes (type modifications, column removals, semantic shifts) require a deprecation window
- Minimum deprecation notice period (e.g., 30 days)
- Communication channel for change announcements
- Versioning strategy (v1 remains available during v2 migration)
Lineage Metadata¶
Where the data comes from:
- Source systems feeding this data product
- Transformation logic summary (not the full DAG -- a human-readable description)
- Refresh dependency chain (what must complete before this product refreshes)
- Data classification and sensitivity labels
Contract Example¶
A concrete contract for a customer analytics data product:
contract:
name: customer_360
version: 2.1.0
owner:
team: customer-domain
contact: customer-data@example.com
escalation: head-of-customer-data@example.com
description: Integrated customer view across all product lines
schema:
fields:
- name: customer_id
type: STRING
nullable: false
description: Global customer identifier (MDM-issued)
- name: full_name
type: STRING
nullable: false
pii: true
- name: lifetime_revenue
type: DECIMAL(18,2)
nullable: true
description: Total revenue across all products, lifetime
- name: risk_score
type: FLOAT
nullable: true
description: Composite risk score (0-1), updated daily
- name: segment
type: STRING
nullable: false
description: Customer segment (premium/standard/basic)
- name: last_activity_date
type: DATE
nullable: true
- name: active_products
type: INTEGER
nullable: false
quality:
freshness_sla: "daily by 06:00 UTC"
completeness: ">= 99.5%"
uniqueness: "customer_id is unique"
validity: "risk_score between 0 and 1"
sla:
availability: "99.9%"
refresh_frequency: "daily"
support_hours: "business hours (UTC+0)"
evolution:
policy: "additive-only for minor versions, breaking changes require major version bump"
deprecation_window: "90 days"
notification: "Slack #data-contracts, email to registered consumers"
lineage:
sources:
- "crm.customers (CDC via Debezium)"
- "core_banking.accounts (daily batch)"
- "payments.transactions (streaming via Kafka)"
transformations: "Deduplicated on MDM customer_id, revenue aggregated from transactions, risk score from ML model output"
Who Owns What¶
The contract is a boundary object. Three teams interact with it, but they own different pieces.
graph TB
subgraph Producer["Producer Team"]
PS[Schema definition]
PQ[Quality rules]
PR[Refresh schedule]
PL[Lineage metadata]
end
subgraph Contract["Data Contract"]
C[Contract Specification]
end
subgraph Platform["Platform Team"]
PE[Enforcement tooling]
PC[Contract registry]
PCI[CI/CD validation]
PM[Monitoring and alerting]
end
subgraph Consumer["Consumer Team"]
CS[SLA requirements]
CU[Usage patterns]
CF[Feedback on quality]
end
PS --> C
PQ --> C
PR --> C
PL --> C
C --> PE
C --> PC
C --> PCI
C --> PM
C --> CS
C --> CU
CF --> C Producer team writes the contract and is accountable for meeting it. They define the schema, set quality thresholds, commit to refresh schedules, and document lineage.
Platform team builds the infrastructure that enforces contracts. They run the contract registry, wire up CI/CD validation, and operate the monitoring that detects breaches. They do not write the contracts -- they make contracts enforceable.
Consumer team declares their requirements and consumes against the contract, not against the underlying implementation. They raise issues when the contract does not meet their needs. They do not reach past the contract to query raw tables.
Contract Enforcement Patterns¶
A contract that exists only in a YAML file is a suggestion. Enforcement is what makes it a contract.
Schema Validation on Write¶
Validate incoming data against the contract schema before it lands in the consumption layer. If the data does not conform -- wrong types, unexpected nulls, missing required columns -- reject it. The producer's pipeline fails, not the consumer's dashboard.
Quality Gates in Pipeline¶
Quality checks run as pipeline steps after data lands but before consumers can access it. If freshness, completeness, uniqueness, or volume thresholds breach, the pipeline fails and the previous good version remains active. Consumers see stale-but-correct data rather than fresh-but-broken data.
Contract CI/CD¶
Schema changes are validated before they merge. The CI pipeline diffs the proposed contract against the current version, flags breaking changes, and blocks the merge if breaking changes lack a deprecation plan. This is the same principle as API versioning -- you do not ship a breaking API change without a migration path.
Consumer Notification¶
When a contract changes -- even non-breaking changes -- consumers get notified automatically. New columns, updated descriptions, adjusted quality thresholds. This is not a courtesy. It is how consumers stay aware of what they are building against.
Evolution and Breaking Changes¶
Data contracts must evolve. The question is not whether they change, but how they change without breaking downstream systems.
Safe Changes (Non-Breaking)¶
- Adding columns -- existing consumers ignore columns they do not use
- Relaxing nullability -- a column that was required becoming optional does not break consumers
- Widening types -- int32 to int64, for example (consumer code handles the wider type)
- Adding new allowed values -- expanding an enum set
Breaking Changes¶
- Removing columns -- consumers referencing deleted columns break immediately
- Changing types -- string to integer, decimal precision changes, timestamp format changes
- Tightening nullability -- a previously optional column becoming required can break producers
- Renaming columns -- semantically identical, technically a drop-and-add
- Changing semantic meaning -- same column name, different business definition (the worst kind)
Versioning Strategy¶
Use semantic versioning for data products:
- v1 and v2 coexist during migration. Consumers are given a deprecation window (minimum 30 days, typically 90) to migrate.
- The producer maintains both versions until the deprecation window closes.
- After the window, the old version is removed. Any consumer who did not migrate breaks -- and that is on them. The deprecation window is the contract.
Anti-Patterns¶
The Contract Nobody Reads¶
What it looks like: Contracts exist in a Confluence page or a docs folder. They were written during project kickoff and never updated. No pipeline validates them. No alert fires when they are violated. They are technically correct and practically useless.
Why it fails: A contract that is not enforced in the pipeline is documentation, not a contract. Documentation drifts from reality. Within six months, the contract says one thing and the data does another.
Fix: Contracts must be machine-readable and enforced in CI/CD and pipeline execution. If the contract is not checked on every run, it does not exist.
The Contract That Blocks Everything¶
What it looks like: Contracts are so strict that any change -- adding a column, adjusting a threshold, updating a description -- requires a formal review process with multiple approvals. Teams stop evolving their data products because the overhead is not worth it.
Why it fails: Overly rigid governance creates shadow systems. Producers route around the contract by publishing "unofficial" datasets that are not governed at all. You end up with less governance, not more.
Fix: Separate additive changes (auto-approved, notify consumers) from breaking changes (require review and deprecation plan). Most contract changes should be frictionless.
The Verbal Contract¶
What it looks like: The producer and consumer teams had a meeting. They agreed on schema, refresh timing, and quality expectations. Everyone nodded. Nobody wrote it down. Six months later, people rotate off the team and the agreement evaporates.
Why it fails: Verbal agreements do not survive team changes, reorgs, or the passage of time. When something breaks, there is no reference point for what was actually promised.
Fix: If it is not in the contract spec, it was not agreed. Every agreement about data delivery gets codified in the contract YAML and enforced in the pipeline. No exceptions.