The Year LLMs Met Compliance — And Compliance Wasn't Ready

October 2023

In March, OpenAI released GPT-4. Within weeks, every business unit at the bank had the same question: how do we use this? By April, there were pilot proposals from wealth management, operations, compliance, legal, and technology. By June, I had lost count. The demand was genuine — GPT-4 is genuinely good — but the governance infrastructure designed to manage model risk was built for a different category of system entirely.

This is the year that large language models met enterprise compliance. And compliance was not ready.

GPT-4 as the Tipping Point

GPT-3 was impressive. ChatGPT was viral. But GPT-4 is the model that made enterprise adoption unavoidable.

The capabilities are real. GPT-4 passes the bar exam in the 90th percentile. It can read and reason about complex documents — contracts, regulatory filings, research reports — with a fluency that was simply not possible twelve months ago. It follows nuanced instructions. It handles multi-step reasoning. When you give it a compliance question with relevant context, the answer is often better than what a junior analyst would produce in twice the time.

This is not hype. I have tested it extensively on real banking use cases — summarizing credit memos, extracting terms from ISDA documents, drafting regulatory responses, classifying customer complaints. The quality is high enough that people who see the output immediately want to deploy it. And that is where the problems start.

Simultaneously, the open-source ecosystem has exploded. Meta released LLaMA in February, and the community fine-tuned it into dozens of variants within weeks. The Technology Innovation Institute released Falcon. Mistral AI, founded by former DeepMind and Meta researchers, released Mistral 7B in September — a model that outperforms LLaMA 2 13B on most benchmarks despite being half the size. For the first time, enterprises have a credible path to running capable language models on their own infrastructure.

The combination — a frontier model that demonstrates what is possible, and open-source models that make on-premise deployment feasible — has created more pressure to adopt LLMs than any technology I have seen in enterprise AI.

The Governance Gap

Every large financial institution has a model risk management framework. Most follow SR 11-7, the Federal Reserve's supervisory guidance on model risk management from 2011. It defines models, sets expectations for development, validation, and governance, and requires a risk-proportionate approach. It is well-designed for the models it was written for.

It was written for logistic regressions, gradient-boosted trees, and neural networks that produce numerical predictions from structured inputs. It was not written for systems that generate free text from natural language prompts, that have been trained on a significant fraction of the public internet, and whose internal representations are not interpretable by any existing method.

SR 11-7 was written for systems on the left. Every enterprise is now deploying systems on the right.

The governance gap is not theoretical. I have watched it play out in real time. A team builds a document summarization prototype using GPT-4's API. It works well in demos. They want to move to production. The model risk team asks: what is the model? They are told it is GPT-4. They ask: can we review the training data? No. Can we inspect the model weights? No. Can we run independent validation on the model itself? No — only on the application built on top of it. Can we guarantee that the same input produces the same output? No. Can we explain why a specific output was produced? Not with any precision.

The model risk team is not being obstructionist. They are applying their framework, and the framework does not have answers for these questions.

Why LLMs Break Traditional Model Validation

There are four specific properties of LLMs that break the assumptions embedded in traditional model risk frameworks:

Non-determinism. Classical models are deterministic: given the same input and model version, they produce the same output. This is fundamental to validation — you test the model on a holdout set, record the results, and those results are reproducible. LLMs sample from a probability distribution. Even with temperature set to zero, floating-point variations across hardware can produce different outputs. This means traditional backtesting and regression testing approaches do not work the same way. You cannot validate an LLM the way you validate a credit scoring model.

Emergent capabilities. LLMs exhibit behaviors that were not explicitly trained and cannot be predicted from the training objective. GPT-4 can do chain-of-thought reasoning, write code in languages that represent a tiny fraction of its training data, and solve logic puzzles. These capabilities emerge at scale and are not well-understood. For model validation, this is a problem: how do you validate a system whose capabilities are not fully catalogued? How do you test for failure modes in behaviors that were never specified?

Training data opacity. SR 11-7 expects that model developers can describe the data used to train their models. For LLMs trained on internet-scale corpora, this is not feasible. OpenAI has not published the training data for GPT-4. Even for open-source models like LLaMA, the training data (a mix of CommonCrawl, Wikipedia, GitHub, ArXiv, and other sources) is described at a high level but cannot be audited at the record level. If the training data contains biased, incorrect, or copyrighted content, the model may reproduce it — and neither the developer nor the validator can trace the source.

The hallucination problem. LLMs produce fluent, confident text that is sometimes factually wrong. This is not a bug that will be fixed with more data or better training. It is an inherent property of how autoregressive language models work — they predict the most likely next token, not the most truthful one. In regulated contexts, hallucination is not an inconvenience. A system that confidently cites a regulation that does not exist, or summarizes a contract clause that was never in the document, is a compliance event. Traditional model risk metrics — accuracy, precision, recall — are necessary but not sufficient. You need a separate evaluation dimension for faithfulness: does the output stay grounded in the provided context, or does the model add information that is not there?

Open-Source LLMs and the Enterprise Path

The open-source LLM ecosystem has changed the conversation about enterprise adoption in ways that were not obvious six months ago.

When the only option was the OpenAI API, enterprise adoption faced a hard wall: you cannot send customer data, financial records, or proprietary documents to a third-party API in a regulated environment. The data residency problem alone killed most use cases.

LLaMA changed this. Meta released the weights in February, and within weeks the community had fine-tuned versions running on consumer hardware. Falcon 40B from the Technology Innovation Institute offered a permissively licensed alternative. Then in September, Mistral 7B demonstrated that a well-trained small model could compete with models several times its size.

For enterprise, this matters for three reasons:

On-premise deployment. You can run Mistral 7B or LLaMA 2 13B on a single GPU within your own data center. No data leaves your network. No third-party API. The data residency problem is solved. This alone unblocks the majority of use cases that were previously impossible.

Fine-tuning on domain data. With open weights, you can fine-tune on your own data. A LLaMA 2 model fine-tuned on internal credit memos produces better summaries of credit memos than GPT-4 with few-shot prompting — because it has learned the specific vocabulary, structure, and conventions of your institution. Fine-tuning also reduces hallucination for domain-specific tasks, because the model learns to stay within the distribution of your data.

Inspectable, if not fully explainable. Open weights do not make LLMs interpretable in the way a logistic regression is interpretable. But they do allow your model risk team to do things that are impossible with a proprietary API: run the model locally, probe its behavior systematically, analyze attention patterns, and ensure the model version does not change without your knowledge. This is not full transparency, but it is a significant improvement over a black-box API endpoint.

What a GenAI Governance Framework Needs

The answer is not to abandon existing model risk frameworks. SR 11-7's principles — risk-proportionate governance, independent validation, ongoing monitoring — are correct. But the framework needs extensions that address the specific properties of LLMs.

Based on what I have seen this year trying to govern LLM deployments in practice, here is what is missing:

A new risk classification dimension for LLMs. Your existing risk matrix probably scores models on data sensitivity, decision autonomy, and financial impact. It needs additional dimensions: hallucination severity (what happens if the output is fabricated?), prompt injection susceptibility (can users manipulate the system's behavior?), and training data opacity (can you audit the data the model was trained on?). These three dimensions change the risk profile of an LLM application significantly compared to a traditional ML model.

A hallucination policy. Your framework has accuracy thresholds. It does not have hallucination thresholds — because logistic regressions do not hallucinate. A hallucination policy defines: what hallucination rate is acceptable for each risk tier, how you measure it (faithfulness evaluation against source context, not just general accuracy), what mitigations are mandatory (grounding instructions, source attribution, the ability to say "I don't know"), and when human review is required before output reaches a consumer.

Validation methods that work for non-deterministic systems. You cannot validate an LLM the way you validate a credit model. You need evaluation suites — large sets of domain-specific test cases with expected outputs — and you need to run them repeatedly to characterize the distribution of outputs, not just a single point estimate. You need adversarial testing: prompt injection attacks, boundary probes, questions designed to elicit hallucination. And you need human evaluation, because automated metrics for open-ended text generation are still unreliable.

Prompt governance. Prompts are the new code. The system prompt determines the behavior of an LLM application as much as the model weights do. If prompts are written ad-hoc, stored in application code, and changed without review, you have ungoverned model configuration. Prompts should be version-controlled, peer-reviewed, tested against evaluation suites, and tracked in an audit trail.

Third-party model risk management. If you use GPT-4 through an API, OpenAI controls the model. They can update it, change its behavior, or deprecate the version you validated against — and you may not be notified. Your framework needs continuous evaluation (run your test suite against the API regularly to detect behavioral changes), contractual protections (DPA, data retention terms, model change notification), and a validated fallback model from a different provider for critical use cases.

Audit trails for prompts and responses. Every interaction with an LLM in a regulated context should be logged: the system prompt, the user input, the retrieved context (for RAG systems), the model version, the raw model output, any guardrail actions, and the final output delivered. When a regulator asks what your AI system told a customer on a specific date, you need to reconstruct the full interaction from logs.

What I Have Seen in Practice

Seven months of watching an enterprise try to adopt LLMs has taught me several things that are not in any vendor whitepaper:

The governance bottleneck is real and immediate. Every team that builds an LLM prototype wants to move to production. The model risk team, which is staffed and tooled for classical ML validation, is now asked to validate systems they do not have frameworks for. The queue grows. Frustration builds on both sides. The risk is not that governance blocks adoption — it is that teams route around governance because it is too slow, deploying LLM applications as "tools" or "utilities" that technically fall outside the model risk framework.

RAG is the most viable enterprise pattern right now. Retrieval-augmented generation — where the LLM answers questions based on retrieved documents rather than its training data — is the architecture that most effectively addresses both the hallucination problem and the data privacy problem. The LLM becomes a reasoning engine over your documents, not a knowledge base. Hallucination rates drop significantly when the model is instructed to only use the provided context. And because the documents are your own, you control what information the model has access to.

The people problem is bigger than the technology problem. Most compliance teams have never interacted with a language model. Most model risk analysts have backgrounds in statistics and quantitative finance, not NLP. Most business users do not understand the difference between a model that knows things and a model that generates plausible text. Training everyone — risk teams, compliance officers, business sponsors, end users — on what LLMs actually are and are not is the prerequisite for everything else.

Start with internal use cases. The lowest-risk, highest-value path is LLMs for internal productivity: summarizing internal documents, drafting first versions of routine reports, answering questions about internal policies. The data sensitivity is lower. The hallucination tolerance is higher (because the user is expected to review). And the regulatory exposure is minimal compared to customer-facing applications. Build governance maturity on internal use cases before attempting customer-facing deployment.

2023 is the year that LLMs became capable enough to be useful in enterprise and available enough — through open-source models — to be deployable. But the governance infrastructure lags behind the technology by at least a year. The organizations that will succeed with enterprise LLMs are not the ones that adopt fastest. They are the ones that extend their risk frameworks now, build evaluation capabilities, train their risk teams, and create the governance structures that allow adoption to be durable rather than reckless.

The capabilities are here. The compliance frameworks are not. That gap is the most important problem in enterprise AI right now.