Data Vault 2.0 in Banking: Architecture for the Audit That Hasn't Happened Yet
In banking, the question is never whether you'll be audited. It's when. And when that audit arrives — regulatory examination, internal model validation, compliance review — the data platform needs to answer a specific set of questions: What data did we have? When did we have it? Where did it come from? Has it been altered?
Most data warehouse architectures can't answer these questions cleanly. They weren't designed to. Kimball star schemas optimize for query performance. Data lakes optimize for flexibility. Neither was designed for the immutable, auditable, source-faithful record that regulated environments demand.
Data Vault 2.0 was.
The Core Idea
Data Vault separates the data warehouse into three concerns, each with distinct rules:
Hubs store business keys — the unique identifiers that define business entities. A customer ID. An account number. A transaction reference. Hubs are insert-only. Once a business key is recorded, it's never updated or deleted.
Links store relationships between business keys. Customer-to-account. Transaction-to-account. Links are also insert-only. A relationship, once recorded, is a permanent fact.
Satellites store descriptive attributes and their change history. Every time a customer's address changes, a new satellite record is inserted — the old record remains untouched. Change detection uses hash differentials: if the hash of the current attributes differs from the hash of the previous load, a new record is created.
Every record in every table carries two mandatory metadata fields: a load timestamp and a record source. This means any record in the warehouse can be traced back to exactly when it arrived and which source system it came from.
Why This Matters in Banking
BCBS 239: Data Aggregation and Reporting
BCBS 239 requires banks to aggregate risk data accurately, completely, and in a timely manner. The principles demand full traceability of data from source to report. Data Vault's insert-only pattern and mandatory source tracking directly satisfy Principle 3 (accuracy and integrity), Principle 4 (completeness), and Principle 6 (adaptability).
When a regulator asks "show me the data that backed this risk report on March 15th," a Data Vault platform can reconstruct the exact state of every entity as it existed at that point in time. A traditional warehouse that updates in place cannot.
Point-in-Time Reconstruction
Because satellites are insert-only with load timestamps, reconstructing the state of any entity at any historical moment is a straightforward temporal query. This isn't just useful for audits — it's essential for model validation, where regulators may ask to see the exact data a model was trained on, or the exact inputs it received at a specific decision point.
Source System Independence
Banks operate dozens of source systems, many of which are legacy, poorly documented, or scheduled for migration. Data Vault's separation of business keys (hubs) from descriptive attributes (satellites) means source system changes don't break the core model. A new source system for customer data simply adds new satellite records — the hub and its history remain untouched.
Implementation Patterns
Hash Keys Over Sequences
Data Vault 2.0 uses hash-based surrogate keys instead of database sequences. The hash is computed from the business key using a deterministic function — typically SHA-256 of the uppercased, null-handled, concatenated business key columns.
This has a practical advantage: hash keys are computed in the staging layer, which means they're identical regardless of load order or parallelism. Two independent loads of the same customer produce the same hash key. This enables parallel loading without coordination — a critical requirement for enterprise-scale platforms.
Hashdiff for Change Detection
Satellite change detection uses a hash differential — a hash computed over all descriptive attributes. On each load, the current hashdiff is compared to the most recent existing record. If they differ, a new record is inserted. If they match, the record is skipped. This is efficient, deterministic, and produces a clean change history without requiring column-by-column comparison.
Ghost Records
When a link references a hub that hasn't yet been loaded (a common scenario with parallel, source-independent loading), a ghost record provides a placeholder. The ghost record uses a deterministic key — typically the hash of an empty string — and is recognizable by its record source of "SYSTEM." Ghost records prevent referential integrity violations while maintaining the principle that every relationship is recorded as received.
The dbt Implementation
dbt makes Data Vault implementation pragmatic. Staging models handle cleansing, typing, and hash key generation. Raw vault models use incremental materialization with merge strategy — insert new records, skip duplicates. Business vault models (PIT tables, bridges) use table materialization for performance. Mart models flatten the vault structure into business-consumable views.
The key insight is that dbt's incremental model maps naturally to Data Vault's insert-only pattern. The merge strategy with a unique key on the hash key ensures idempotent loads — a critical property for reliability in production.
I've open-sourced a reference implementation: dbt-data-vault-starter covers hubs, links, satellites, PIT tables, bridges, and marts with a banking domain example.
When Data Vault Is Wrong
Data Vault adds modeling complexity. For small teams, non-regulated environments, or analytics-only use cases where historical auditability isn't a requirement, a simpler approach — even a well-structured star schema — may be more appropriate.
The question isn't whether Data Vault is the best pattern in general. It's whether your environment requires full history, source traceability, and audit defensibility by default. In banking, the answer is almost always yes.
For a production-grade implementation including Terraform, dbt, and compliance patterns mapped to BCBS 239 and DORA, see reference-data-platform-gcp.