Building Data Foundations While Everyone Chased Models

March 2022

The industry conversation right now is dominated by models. GPT-3 is almost two years old and every enterprise wants to know how to use large language models. Startups are raising rounds on the premise of fine-tuning foundation models for vertical use cases. The discourse is about parameters, benchmarks, and which API to call.

Meanwhile, I am at a global bank, building a data platform. Not a model. Not an ML pipeline. A data platform. The thing that ingests data from source systems, stores it with full lineage, transforms it into something trustworthy, and makes it available to anyone who needs it — including, eventually, models.

This is not glamorous work. Nobody writes Twitter threads about data vault modeling or BCBS 239 compliance. But after a year of doing this, I am increasingly convinced that the data platform is the single most important prerequisite for enterprise AI — and the one most organizations skip.

The Model Is Not the Hard Part

Here is a pattern I have seen repeatedly: a data science team builds a promising model. It works in the notebook. The accuracy metrics are good. Leadership is excited. Then someone asks where the training data came from, and the answer is "I exported it from a dashboard" or "someone in operations sent me a CSV." The model is built on sand.

In a regulated enterprise, this is not just technically fragile — it is a compliance failure. When a regulator asks you to demonstrate the provenance of data that fed a model used in a business decision, "I got a CSV" is not an acceptable answer. You need to trace data from source system to model input, with every transformation documented, every quality check recorded, and every version preserved.

This is what a data platform provides. Without it, every AI initiative is an isolated experiment that cannot be trusted, audited, or scaled.

Data Vault 2.0: Why It Matters in Banking

The data modeling pattern we chose is Data Vault 2.0. It is not new — Dan Linstedt published the methodology years ago — but it is unusually well suited to the constraints of banking.

The core idea is a separation of concerns into three entity types:

Hubs store business keys — the unique identifiers for business entities. A customer ID. An account number. A trade reference. Once a business key is inserted, it is never updated or deleted.

Links store relationships between hubs. Customer-to-account. Account-to-transaction. Like hubs, links are insert-only. A relationship, once recorded, is permanent.

Satellites store descriptive attributes and their full change history. When a customer's address changes, a new satellite record is inserted — the old record stays. Change detection uses a hash differential: if the hash of the current payload differs from the previous load, a new record is created. If it matches, nothing happens.

Every record in every table carries a load timestamp and a record source. This means any fact in the warehouse can be traced to exactly when it arrived and which upstream system produced it.

This sounds academic until you are sitting across from a regulator who asks: "Show me the customer data that backed this risk calculation on February 3rd." With Data Vault, that is a temporal query. With a traditional warehouse that updates in place, it is a prayer.

BCBS 239: The Regulation Nobody Talks About

BCBS 239 is a Basel Committee framework for risk data aggregation and reporting. It was published in 2013 as a response to the 2008 financial crisis, where banks discovered they could not aggregate their risk exposures quickly or accurately enough to understand what was happening to them.

The framework defines 14 principles, but four are particularly relevant to data platform architecture:

Accuracy and integrity (Principle 3). Data used in risk reporting must be accurate, reliable, and sourced from authoritative systems. This means your data platform needs to track provenance — where data came from, how it was transformed, and whether it passed quality checks.

Completeness (Principle 4). Risk data must cover all material exposures across the enterprise. This means your platform cannot be a departmental silo. It needs to integrate data from multiple source systems, multiple business lines, and multiple geographies.

Timeliness (Principle 5). Data must be available within agreed timeframes. Not "when the batch job finishes" but within committed SLAs. This means your platform needs operational reliability — monitoring, alerting, and incident management for data pipelines.

Adaptability (Principle 6). The data infrastructure must be flexible enough to accommodate new reporting requirements without major rework. This is where Data Vault shines — adding a new source system means adding new satellites to existing hubs, not restructuring the model.

Most data teams I talk to have never heard of BCBS 239. But if you are building a data platform in a bank, it is the specification your architecture needs to satisfy. It is not optional. Regulators check compliance.

The dbt + BigQuery Stack

For the transformation layer, we use dbt running on BigQuery. The choice was deliberate.

dbt forces a discipline that most data teams lack: every transformation is a SQL file in version control. Every model has tests — not_null, unique, accepted_values, referential integrity. Every dependency is explicit. Every run produces an audit log of what was executed, what passed, and what failed.

In a Data Vault implementation, this translates directly:

Staging models handle cleansing, typing, and hash key generation. A staging model takes raw source data and produces a clean, typed, hash-keyed record ready for vault loading.
Raw vault models implement the insert-only pattern using dbt's incremental materialization. The merge strategy with a unique key on the hash ensures idempotent loads — run it twice, get the same result.
Business vault models compute derived business logic — point-in-time (PIT) tables and bridge tables that make the vault queryable without complex temporal joins.
Mart models flatten the vault into business-consumable views. This is the layer that analysts and downstream applications see.

BigQuery handles the compute. Its columnar storage and separation of storage from compute means we can store the full insert-only history — which grows fast — without the cost being proportional to storage volume. We pay for queries, not for keeping data.

Why "Good Data" Is the Hardest Problem

I used to think the hard problems in data engineering were technical — schema evolution, exactly-once delivery, query optimization. Those are real problems, but they are solvable problems with known solutions.

The hard problems are organizational:

Source system ownership is ambiguous. The system that produces the data often has no clear owner. Or the owner is a team that views the data as a byproduct of an operational process, not as an asset that downstream consumers depend on. When that system changes its schema without notice — and it will — there is no one to call.

Data quality is everyone's problem and nobody's job. Everyone agrees data quality matters. Nobody wants to own the alerting, triage, and remediation process. Data quality degrades silently until something breaks visibly — usually a report to a regulator.

Business definitions are not agreed upon. Ask three people in a bank what "customer" means and you will get four answers. Is it the legal entity? The relationship? The individual? The account holder? Until these definitions are codified in a data catalog and enforced in transformation logic, every downstream use case is built on ambiguity.

Data access is a political negotiation. In a regulated enterprise, data access is governed by classification, jurisdiction, and purpose. Getting access to the data you need for a legitimate use case can take weeks. Not because the process is wrong — data protection matters — but because the process is often manual, undocumented, and dependent on knowing the right person to ask.

None of these problems are solved by better models. They are solved by better data platforms, better governance, and better organizational clarity about who owns what.

Practical Lessons from the First Year

A year into building a data platform in a regulated environment, here is what I would tell someone starting the same work:

Start with the audit, not the dashboard. Design your platform as if the first consumer is a regulator, not a business analyst. If the platform can satisfy a regulatory audit — full lineage, point-in-time reconstruction, quality evidence — it can satisfy any downstream use case. The reverse is not true.

Hash keys are worth the upfront cost. Computing SHA-256 hashes in the staging layer adds complexity. But it enables parallel loading without coordination, deterministic key assignment across environments, and change detection without column-by-column comparison. The payoff compounds with every source system you add.

Test data, not just code. dbt's testing framework should be used aggressively. Not just schema tests (not_null, unique) but custom tests for business rules. If a regulation says customer records must have a valid LEI, write a test that checks for it. Make the test suite a living expression of your data contract.

Invest in the staging layer. Most data platform problems originate in the staging layer — the boundary between the source system and the warehouse. Get the typing, cleansing, null handling, and hash generation right in staging, and the rest of the pipeline is mechanical. Get staging wrong, and errors propagate through every downstream model.

Document as you build. dbt's documentation generation is not a nice-to-have. It is the artefact that proves your platform is governable. When someone asks "what does this column mean?" the answer should be in the dbt docs, not in someone's head.

Accept that this is slow work. Building a data platform in a regulated environment is slow. Not because the technology is slow, but because the organizational coordination is slow. Source system onboarding requires negotiation. Data classification requires review. Schema changes require impact analysis. This is the nature of the work. If you are impatient with it, you will cut corners that create audit findings later.

The Connection to AI

Everything I have described — Data Vault, BCBS 239, dbt, lineage, quality testing — is foundational infrastructure. It is not AI. But it is what makes AI possible in a regulated enterprise.

A model needs training data with known provenance. The data platform provides that. A model needs feature inputs with documented quality. The data platform provides that. A model in production needs monitoring data to detect drift. The data platform provides that. A model under regulatory review needs full lineage from source to prediction. The data platform provides that.

The organisations that will succeed with enterprise AI are not the ones that move fastest to adopt the latest model architecture. They are the ones that build the data foundations first — accurately, completely, with full traceability — and then deploy models on top of infrastructure they can trust and defend.

That is the work. It is not exciting. But it is the work that matters.

The industry will keep chasing models. Bigger models, faster models, models that write code and generate images. That is fine. Someone needs to build the part underneath — the part that makes enterprise data trustworthy enough for any of those models to be useful. That is the part I am building.