Platform Reliability Model¶

Executive Summary¶

Reliability for a data platform is not uptime. It is data freshness, pipeline success, query availability, and recovery speed -- measured against explicit SLOs.
Every data product must have a defined SLA. The platform must have SLOs that ensure those SLAs are achievable.
Incident classification must distinguish between platform-wide failures, critical data product issues, and non-critical operational noise.
Recovery is deterministic when source data is immutable. Bronze-layer preservation makes reprocessing a pattern, not a prayer.
Post-incident process is not optional. Every P1 and P2 incident gets a blameless post-mortem with root cause classification, action items, and runbook updates.

graph TD
    DET[Detection<br/>Monitoring Alert / Consumer Report] --> CLS{Classify}
    CLS --> |P1: Platform-wide| P1[Immediate Response<br/>15 min SLA]
    CLS --> |P2: Critical product| P2[Urgent Response<br/>30 min SLA]
    CLS --> |P3: Non-critical| P3[Standard Response<br/>2 hour SLA]
    CLS --> |P4: Minor| P4[Planned Response<br/>Next sprint]

    P1 --> RES[Resolve]
    P2 --> RES
    P3 --> RES
    P4 --> RES

    RES --> PM[Post-Mortem<br/>within 48 hours]
    PM --> RB[Update Runbooks]
    RB --> DET

Platform SLOs¶

SLOs define what "reliable" means. Without them, reliability is a feeling, not a measurement.

SLO	Target	Measurement
Pipeline success rate	> 99.5%	Successful runs / total scheduled runs per day
Data freshness	Within SLA per product	Time since last successful refresh vs product-level SLA
Query availability	> 99.5%	Successful queries / total queries per hour
Recovery time (P1)	< 2 hours	Time from detection to resolution for platform-wide incidents
Recovery time (P2)	< 4 hours	Time from detection to resolution for critical product incidents

Key principle: SLOs are not aspirations. They are commitments with error budgets. When the error budget is consumed, the team stops feature work and focuses on reliability. If leadership does not enforce this, SLOs are decorative.

SLO Measurement Rules¶

Pipeline success rate excludes intentionally disabled pipelines. It includes retries -- a pipeline that fails twice and succeeds on the third attempt counts as one failure and one success.
Data freshness is measured continuously, not at a point in time. A product that is within SLA for 23 hours and stale for 1 hour has a freshness breach.
Query availability counts only platform-caused failures. A malformed user query that returns an error is not an availability failure.

Incident Classification¶

Priority	Definition	Example	Response Time	Resolution Time
P1	Platform-wide failure or regulatory data unavailable	All pipelines down, audit data inaccessible, query engine offline	15 minutes	2 hours
P2	Critical data product stale or quality breach	Customer 360 more than 4 hours stale, regulatory report data quality below threshold	30 minutes	4 hours
P3	Non-critical pipeline failure	One domain's daily refresh failed, single non-critical product stale	2 hours	Next business day
P4	Minor issue, workaround available	Slow query performance, non-blocking metadata sync delay	4 hours	Planned sprint

Escalation Rules¶

P1 incidents trigger immediate page to on-call platform engineer and engineering manager. If no acknowledgment in 15 minutes, escalate to platform lead.
P2 incidents trigger page to on-call platform engineer. If no acknowledgment in 30 minutes, escalate to engineering manager.
P3 and P4 incidents are handled during business hours via the standard ticket queue.
Any incident that is not resolved within its resolution window is automatically escalated one priority level.

Recovery Patterns¶

Recovery is not improvised. Each failure mode maps to a known recovery pattern.

Reprocessing¶

Replay from bronze. Source data in the landing zone is immutable, so recovery is deterministic -- the same input produces the same output. This is why bronze-layer immutability is a non-negotiable architectural principle.

Consideration	Detail
When to use	Transformation logic was incorrect, silver/gold data is corrupted, pipeline produced wrong output
Prerequisite	Bronze data is intact and immutable
Impact	Downstream products are temporarily stale during reprocessing
Validation	Compare reprocessed output against known-good state or business rules

Backfill¶

Re-ingest from source for a specific time window. Used when bronze data itself is missing or corrupted, or when a new source field must be historically populated.

Consideration	Detail
When to use	Bronze data is missing, source schema changed and history must be re-extracted
Prerequisite	Source system supports historical extraction for the required time window
Impact	Source system load increases during backfill; coordinate with source team
Validation	Row count reconciliation and checksum comparison against source

Rollback¶

Revert to a previous version of a data product -- both schema and data. This is table-level time travel, not pipeline rollback.

Consideration	Detail
When to use	A bad deployment corrupted a data product, consumers need immediate restoration
Prerequisite	Table format supports time travel (Delta Lake, Iceberg) with sufficient retention
Impact	Consumers see previous version immediately; reprocessing can happen in parallel
Validation	Confirm rolled-back version matches expected state, notify consumers

Failover¶

Switch to a disaster recovery region or replica. Used for infrastructure-level failures, not data quality issues.

Consideration	Detail
When to use	Primary region is unavailable, infrastructure failure, cloud provider incident
Prerequisite	DR region is provisioned, data replication is current, DNS/routing can be switched
Impact	RPO determines data loss; RTO determines downtime
Validation	Confirm DR environment serves current data, test consumer connectivity

Dependency Management¶

A data platform does not exist in isolation. It depends on upstream source systems and serves downstream consumers. Both directions must be mapped and managed.

Upstream Dependencies¶

Dependency	What to Track	Failure Behavior
Source systems (ERP, CRM, core banking)	Availability, schema version, data freshness	Retry with exponential backoff, alert after N failures
Event backbone (Kafka, Pub/Sub)	Consumer lag, partition health, throughput	Buffer locally if possible, alert on lag threshold breach
Third-party data vendors	Delivery schedule, file format, data quality	Hold processing, alert data steward, serve stale data with flag
Identity and access management	Authentication availability	Cache tokens, fail open for reads with audit, fail closed for writes

Downstream Dependencies¶

Dependency	What to Track	Failure Behavior
Data products and consumers	Consumer count, query patterns, SLA commitments	Notify consumers of staleness, serve stale data with metadata flag
ML model training pipelines	Feature freshness, training schedule	Delay training, do not serve stale features without explicit acknowledgment
Regulatory reporting	Report deadlines, data quality thresholds	Escalate immediately to P1 if regulatory deadline is at risk
Operational serving stores	Replication lag, consistency	Alert on lag, switch to direct source if replication fails

Circuit Breaker Pattern¶

Stop processing when upstream quality drops below an acceptable threshold. This prevents bad data from propagating through the platform.

Trigger: Upstream data fails quality checks (null rate spike, volume anomaly, schema drift) beyond a configured threshold.
Action: Halt downstream processing for that source, serve last-known-good data, alert data steward and source team.
Reset: Manual or automatic after upstream quality is restored and validated. Never auto-reset without quality validation.
Key principle: It is better to serve stale data with a freshness warning than to serve wrong data silently. The circuit breaker enforces this.

Post-Incident Process¶

Every P1 and P2 incident triggers a structured post-incident process. This is not optional, and it is not blame-assignment. It is organizational learning.

Blameless Post-Mortem¶

Conducted within 48 hours of incident resolution. Attendance includes the responders, the platform lead, and affected consumers.

Section	Content
Timeline	Minute-by-minute account: detection, response, diagnosis, resolution, verification
Root cause	The actual technical cause, not "human error." If a human made a mistake, ask what system allowed that mistake to have impact
Contributing factors	What made detection slow, response difficult, or resolution complex
Impact	Data products affected, consumers impacted, duration of impact, regulatory implications

Root Cause Classification¶

Every incident root cause is classified into one of four categories. This enables trend analysis across incidents.

Category	Example
Source	Upstream schema change without notice, source system outage, data quality degradation at source
Platform	Infrastructure failure, capacity exhaustion, configuration drift, deployment error
Transformation	Logic bug in pipeline, incorrect join, failed schema evolution handling
Consumer	Consumer query overloading the platform, consumer not respecting rate limits

Action Items¶

Every post-mortem produces action items. Every action item has an owner, a deadline, and a verification method.

Runbook update: If this failure mode was not in the runbook, add it. If the runbook was wrong, fix it.
Monitoring gap: If detection was slow, add the missing alert or dashboard.
Architectural fix: If the failure was structural, schedule the fix with a deadline -- do not leave it as tech debt without a timeline.
Process change: If the failure was procedural, update the process and communicate the change.

Key principle: A post-mortem without action items is a storytelling session. Action items without deadlines are wishes. Deadlines without owners are fiction.