Platform Reliability Model¶
Executive Summary¶
- Reliability for a data platform is not uptime. It is data freshness, pipeline success, query availability, and recovery speed -- measured against explicit SLOs.
- Every data product must have a defined SLA. The platform must have SLOs that ensure those SLAs are achievable.
- Incident classification must distinguish between platform-wide failures, critical data product issues, and non-critical operational noise.
- Recovery is deterministic when source data is immutable. Bronze-layer preservation makes reprocessing a pattern, not a prayer.
- Post-incident process is not optional. Every P1 and P2 incident gets a blameless post-mortem with root cause classification, action items, and runbook updates.
graph TD
DET[Detection<br/>Monitoring Alert / Consumer Report] --> CLS{Classify}
CLS --> |P1: Platform-wide| P1[Immediate Response<br/>15 min SLA]
CLS --> |P2: Critical product| P2[Urgent Response<br/>30 min SLA]
CLS --> |P3: Non-critical| P3[Standard Response<br/>2 hour SLA]
CLS --> |P4: Minor| P4[Planned Response<br/>Next sprint]
P1 --> RES[Resolve]
P2 --> RES
P3 --> RES
P4 --> RES
RES --> PM[Post-Mortem<br/>within 48 hours]
PM --> RB[Update Runbooks]
RB --> DET Platform SLOs¶
SLOs define what "reliable" means. Without them, reliability is a feeling, not a measurement.
| SLO | Target | Measurement |
|---|---|---|
| Pipeline success rate | > 99.5% | Successful runs / total scheduled runs per day |
| Data freshness | Within SLA per product | Time since last successful refresh vs product-level SLA |
| Query availability | > 99.5% | Successful queries / total queries per hour |
| Recovery time (P1) | < 2 hours | Time from detection to resolution for platform-wide incidents |
| Recovery time (P2) | < 4 hours | Time from detection to resolution for critical product incidents |
Key principle: SLOs are not aspirations. They are commitments with error budgets. When the error budget is consumed, the team stops feature work and focuses on reliability. If leadership does not enforce this, SLOs are decorative.
SLO Measurement Rules¶
- Pipeline success rate excludes intentionally disabled pipelines. It includes retries -- a pipeline that fails twice and succeeds on the third attempt counts as one failure and one success.
- Data freshness is measured continuously, not at a point in time. A product that is within SLA for 23 hours and stale for 1 hour has a freshness breach.
- Query availability counts only platform-caused failures. A malformed user query that returns an error is not an availability failure.
Incident Classification¶
| Priority | Definition | Example | Response Time | Resolution Time |
|---|---|---|---|---|
| P1 | Platform-wide failure or regulatory data unavailable | All pipelines down, audit data inaccessible, query engine offline | 15 minutes | 2 hours |
| P2 | Critical data product stale or quality breach | Customer 360 more than 4 hours stale, regulatory report data quality below threshold | 30 minutes | 4 hours |
| P3 | Non-critical pipeline failure | One domain's daily refresh failed, single non-critical product stale | 2 hours | Next business day |
| P4 | Minor issue, workaround available | Slow query performance, non-blocking metadata sync delay | 4 hours | Planned sprint |
Escalation Rules¶
- P1 incidents trigger immediate page to on-call platform engineer and engineering manager. If no acknowledgment in 15 minutes, escalate to platform lead.
- P2 incidents trigger page to on-call platform engineer. If no acknowledgment in 30 minutes, escalate to engineering manager.
- P3 and P4 incidents are handled during business hours via the standard ticket queue.
- Any incident that is not resolved within its resolution window is automatically escalated one priority level.
Recovery Patterns¶
Recovery is not improvised. Each failure mode maps to a known recovery pattern.
Reprocessing¶
Replay from bronze. Source data in the landing zone is immutable, so recovery is deterministic -- the same input produces the same output. This is why bronze-layer immutability is a non-negotiable architectural principle.
| Consideration | Detail |
|---|---|
| When to use | Transformation logic was incorrect, silver/gold data is corrupted, pipeline produced wrong output |
| Prerequisite | Bronze data is intact and immutable |
| Impact | Downstream products are temporarily stale during reprocessing |
| Validation | Compare reprocessed output against known-good state or business rules |
Backfill¶
Re-ingest from source for a specific time window. Used when bronze data itself is missing or corrupted, or when a new source field must be historically populated.
| Consideration | Detail |
|---|---|
| When to use | Bronze data is missing, source schema changed and history must be re-extracted |
| Prerequisite | Source system supports historical extraction for the required time window |
| Impact | Source system load increases during backfill; coordinate with source team |
| Validation | Row count reconciliation and checksum comparison against source |
Rollback¶
Revert to a previous version of a data product -- both schema and data. This is table-level time travel, not pipeline rollback.
| Consideration | Detail |
|---|---|
| When to use | A bad deployment corrupted a data product, consumers need immediate restoration |
| Prerequisite | Table format supports time travel (Delta Lake, Iceberg) with sufficient retention |
| Impact | Consumers see previous version immediately; reprocessing can happen in parallel |
| Validation | Confirm rolled-back version matches expected state, notify consumers |
Failover¶
Switch to a disaster recovery region or replica. Used for infrastructure-level failures, not data quality issues.
| Consideration | Detail |
|---|---|
| When to use | Primary region is unavailable, infrastructure failure, cloud provider incident |
| Prerequisite | DR region is provisioned, data replication is current, DNS/routing can be switched |
| Impact | RPO determines data loss; RTO determines downtime |
| Validation | Confirm DR environment serves current data, test consumer connectivity |
Dependency Management¶
A data platform does not exist in isolation. It depends on upstream source systems and serves downstream consumers. Both directions must be mapped and managed.
Upstream Dependencies¶
| Dependency | What to Track | Failure Behavior |
|---|---|---|
| Source systems (ERP, CRM, core banking) | Availability, schema version, data freshness | Retry with exponential backoff, alert after N failures |
| Event backbone (Kafka, Pub/Sub) | Consumer lag, partition health, throughput | Buffer locally if possible, alert on lag threshold breach |
| Third-party data vendors | Delivery schedule, file format, data quality | Hold processing, alert data steward, serve stale data with flag |
| Identity and access management | Authentication availability | Cache tokens, fail open for reads with audit, fail closed for writes |
Downstream Dependencies¶
| Dependency | What to Track | Failure Behavior |
|---|---|---|
| Data products and consumers | Consumer count, query patterns, SLA commitments | Notify consumers of staleness, serve stale data with metadata flag |
| ML model training pipelines | Feature freshness, training schedule | Delay training, do not serve stale features without explicit acknowledgment |
| Regulatory reporting | Report deadlines, data quality thresholds | Escalate immediately to P1 if regulatory deadline is at risk |
| Operational serving stores | Replication lag, consistency | Alert on lag, switch to direct source if replication fails |
Circuit Breaker Pattern¶
Stop processing when upstream quality drops below an acceptable threshold. This prevents bad data from propagating through the platform.
- Trigger: Upstream data fails quality checks (null rate spike, volume anomaly, schema drift) beyond a configured threshold.
- Action: Halt downstream processing for that source, serve last-known-good data, alert data steward and source team.
- Reset: Manual or automatic after upstream quality is restored and validated. Never auto-reset without quality validation.
- Key principle: It is better to serve stale data with a freshness warning than to serve wrong data silently. The circuit breaker enforces this.
Post-Incident Process¶
Every P1 and P2 incident triggers a structured post-incident process. This is not optional, and it is not blame-assignment. It is organizational learning.
Blameless Post-Mortem¶
Conducted within 48 hours of incident resolution. Attendance includes the responders, the platform lead, and affected consumers.
| Section | Content |
|---|---|
| Timeline | Minute-by-minute account: detection, response, diagnosis, resolution, verification |
| Root cause | The actual technical cause, not "human error." If a human made a mistake, ask what system allowed that mistake to have impact |
| Contributing factors | What made detection slow, response difficult, or resolution complex |
| Impact | Data products affected, consumers impacted, duration of impact, regulatory implications |
Root Cause Classification¶
Every incident root cause is classified into one of four categories. This enables trend analysis across incidents.
| Category | Example |
|---|---|
| Source | Upstream schema change without notice, source system outage, data quality degradation at source |
| Platform | Infrastructure failure, capacity exhaustion, configuration drift, deployment error |
| Transformation | Logic bug in pipeline, incorrect join, failed schema evolution handling |
| Consumer | Consumer query overloading the platform, consumer not respecting rate limits |
Action Items¶
Every post-mortem produces action items. Every action item has an owner, a deadline, and a verification method.
- Runbook update: If this failure mode was not in the runbook, add it. If the runbook was wrong, fix it.
- Monitoring gap: If detection was slow, add the missing alert or dashboard.
- Architectural fix: If the failure was structural, schedule the fix with a deadline -- do not leave it as tech debt without a timeline.
- Process change: If the failure was procedural, update the process and communicate the change.
Key principle: A post-mortem without action items is a storytelling session. Action items without deadlines are wishes. Deadlines without owners are fiction.