RAG in Production: What Breaks When You Move Past the Tutorial

June 2024

Every RAG tutorial follows the same arc. Load some PDFs. Split them into chunks. Embed the chunks with OpenAI. Store them in a vector database. At query time, retrieve the top-k similar chunks, stuff them into a prompt, and call the LLM. It works. The demo is impressive. Stakeholders are excited.

Then you try to put it into production. And everything that was easy in the tutorial becomes hard in ways the tutorial never mentioned.

I've been building RAG systems in a banking environment for the past year. Not prototypes — production systems that handle regulatory documents, internal policies, and client-facing knowledge bases where a wrong answer has real consequences. What follows is what I've learned about where RAG breaks and what actually works.

The Gap Between Demo and Production

Tutorial RAG works because the conditions are controlled. The corpus is small and clean. The questions are the kind that have obvious answers sitting in a single paragraph. There's no ambiguity, no multi-document reasoning, no adversarial inputs, and nobody checking whether the answer is actually correct.

Production RAG breaks these assumptions in every dimension:

The corpus is messy. Documents have headers, footers, tables, nested lists, cross-references, and version histories. PDFs are the worst — the same visual layout can produce wildly different text depending on the extraction tool.
The questions are ambiguous. Users don't ask "What is the maximum LTV ratio for commercial real estate?" — they ask "what's the limit for CRE?" and expect the system to resolve the acronym, find the right policy, and handle the fact that the limit differs by jurisdiction.
Wrong answers have consequences. In a regulated environment, a hallucinated policy citation isn't a demo glitch. It's a compliance incident.
Scale changes everything. A vector store with 10,000 chunks behaves differently from one with 10 million. Retrieval quality degrades. Latency increases. The naive "embed and search" approach stops being sufficient.

Tutorial RAG stops at generation. Production RAG adds hybrid search, re-ranking, guardrails, and evaluation.

Chunking Matters More Than Your Embedding Model

Teams spend weeks evaluating embedding models — OpenAI text-embedding-ada-002 vs. Cohere embed-v3 vs. open-source alternatives like e5-large or bge-large. This matters, but not as much as they think. The bigger lever is chunking.

Fixed-size chunking (split every 512 tokens with 50-token overlap) is the tutorial default. It's fast and predictable. It's also terrible for documents with structure. A fixed-size chunk can split a regulation mid-sentence, separate a table from its header, or merge the conclusion of one section with the introduction of the next. The embedding captures the semantics of this jumbled text faithfully — the problem is the text itself is semantically incoherent.

What works better:

Structure-aware chunking. Use document structure — headings, sections, paragraphs — as chunk boundaries. A section about "Capital Requirements for Commercial Real Estate" should be one chunk, not three fragments of it mixed with adjacent sections. This requires parsing the document structure, which is straightforward for HTML and Markdown but painful for PDF.
Semantic chunking. Use embedding similarity between adjacent sentences to find natural break points. When the embedding similarity between consecutive sentences drops below a threshold, insert a chunk boundary. This produces chunks that are semantically coherent without requiring structural markup.
Parent-child chunking. Embed small chunks for retrieval precision, but store them with a reference to the larger parent chunk. At retrieval time, return the parent. This gives you the precision of small chunks with the context of large ones. LlamaIndex calls this "sentence window retrieval." The idea is sound regardless of the framework.

In our regulatory document system, switching from fixed-size to structure-aware chunking improved retrieval precision by roughly 25% — without changing the embedding model, the vector database, or anything else in the pipeline. Chunking is the foundation. Get it wrong and everything downstream is working with degraded inputs.

Retrieval Quality Is the Bottleneck

The single most important insight from building production RAG: if the right information isn't in the retrieved chunks, no amount of prompt engineering will fix the output. The LLM can only work with what you give it.

Most RAG debugging sessions that start with "the model is hallucinating" end with "the retriever didn't return the relevant document." Generation quality is rarely the bottleneck. Retrieval quality almost always is.

This reframes where you invest your engineering effort. Instead of fine-tuning the LLM or crafting elaborate system prompts, focus on:

Hybrid search. Pure semantic search (vector similarity) fails on exact terms. If a user asks about "BCBS 239 Principle 3," a semantic search might return documents about data governance in general rather than the specific principle. Pure keyword search (BM25) fails on paraphrases. Hybrid search — combining BM25 and vector similarity with reciprocal rank fusion — handles both cases. Every production system I've seen that works well uses hybrid search. It's the practical default.

Metadata filtering. Not every chunk is relevant to every query. If a user asks about the current policy, you don't want the system retrieving superseded versions. Metadata filtering — document type, effective date, jurisdiction, department — reduces the search space before semantic search even runs. This is simple to implement and dramatically reduces noise in the results.

Re-ranking. Embedding similarity is a rough sort. A cross-encoder re-ranker (Cohere Rerank, a fine-tuned BERT model, or ColBERT) takes the initial candidate set and re-scores them with a more expensive but more accurate model that looks at the query and each candidate together. Re-ranking typically lifts the top-1 accuracy by 10-20% over vector search alone. It's the single highest-leverage addition to a naive RAG pipeline.

The retrieval pipeline is a funnel. Hybrid search casts wide, re-ranking narrows to the most relevant.

Evaluation Is the Hardest Part

How do you know if your RAG system is working? This is the question most teams avoid. They eyeball a few examples, declare it "pretty good," and move on. This works until it doesn't — and in a regulated environment, "pretty good" is not an acceptable standard.

RAG evaluation has two distinct problems:

Retrieval evaluation. Did the retriever return the right chunks? This requires a ground truth dataset — a set of questions paired with the documents that contain the correct answer. Building this dataset is labor-intensive. There's no shortcut. You need domain experts to curate a representative set of questions, identify the correct source passages, and validate that the passages actually answer the questions.

Once you have ground truth, standard retrieval metrics apply: recall@k (did the right document appear in the top k results?), MRR (where did the right document rank?), and precision@k (what fraction of the top k results were relevant?). These metrics are well-understood, cheap to compute, and directly actionable.

Generation evaluation. Did the LLM produce a correct, grounded, well-formed answer? This is harder. Automated metrics exist — RAGAS provides faithfulness (is the answer supported by the retrieved context?), answer relevancy (does the answer address the question?), and context recall (does the retrieved context contain the necessary information?). These are useful directional signals. They are not sufficient for compliance sign-off.

In our systems, we run RAGAS metrics on every evaluation cycle to catch regressions, but the final quality gate is human review. A sample of production queries is reviewed weekly by subject matter experts who check factual accuracy, source attribution, and completeness. This is expensive. It's also the only way to build the level of confidence that a regulated environment requires.

The practical framework:

Pre-production: Build a curated evaluation set of 200-500 question-answer-source triples. Measure retrieval recall and RAGAS metrics. Set thresholds. Block deployment if thresholds aren't met.
Production: Log every query, the retrieved chunks, and the generated answer. Run automated metrics on a rolling basis. Flag anomalies for human review. Maintain a growing golden dataset from validated production queries.
Feedback loops: When human reviewers identify failures, trace whether the failure was retrieval (wrong chunks) or generation (right chunks, wrong answer). This distinction drives where you invest improvement effort.

Guardrails in Regulated Contexts

In banking, a RAG system that produces a plausible but incorrect answer about capital requirements, sanctions policy, or client eligibility isn't just a bad user experience. It's a potential regulatory violation. Guardrails aren't a nice-to-have. They're the cost of admission.

What we've found works:

Source citation as a hard requirement. Every answer must cite the specific document and section it's drawing from. If the system can't identify a source, it says so. "I don't have enough information to answer that" is a better output than a confident hallucination. Implementing this means the LLM must be instructed to ground every claim in retrieved context and the system must verify that the cited passages actually exist in the retrieval results.

Confidence-based routing. When retrieval scores are low — below a calibrated threshold — the system escalates to a human rather than generating a speculative answer. This requires tuning the threshold against your evaluation set: too high and the system escalates everything, too low and you let through unreliable answers. We found that retrieval score alone isn't sufficient; we also check whether the top-ranked chunk actually contains an answer to the question using a lightweight NLI (natural language inference) model.

Topic boundaries. The system should only answer questions within its domain. A RAG system trained on credit policy documents should not answer questions about employment law, even if some tangentially relevant text exists in the corpus. Topic classification on the input query, using either a fine-tuned classifier or a few-shot LLM prompt, prevents the system from overreaching.

Output validation. Before returning an answer to the user, run a check: does the answer contradict any of the retrieved context? Does it contain claims not supported by the context? This is essentially a faithfulness check at inference time. It adds latency — typically 200-500ms for a lightweight model — but it catches the most dangerous failure mode: a fluent, confident, wrong answer.

Architecture Patterns That Work

After iterating through several approaches, these are the patterns I'd recommend for a production RAG system in an enterprise context:

Separate ingestion from serving. Document processing, chunking, and embedding should be a batch pipeline that runs asynchronously. The query-time path should hit an already-populated index. Mixing ingestion and serving creates latency and reliability issues.

Version your index. When you re-chunk or re-embed your corpus (and you will, as you iterate on chunking strategy), maintain the previous version alongside the new one. This lets you A/B test retrieval quality and roll back if the new approach regresses. Treat your vector index like a deployment artifact, not a mutable database.

Use a query understanding layer. Before embedding the user's query, process it. Expand acronyms. Resolve ambiguity. For complex questions, decompose them into sub-queries. This can be as simple as a prompt that rewrites the query for retrieval, or as sophisticated as a multi-step planning agent. The key insight is that user queries are optimized for human communication, not for vector similarity search. A translation layer helps.

Cache aggressively. Many RAG queries are variations of the same question. Semantic caching — mapping similar queries to cached responses — reduces latency and cost. In our system, roughly 30% of queries are close enough to a previously answered question that we can serve a cached response with high confidence.

Monitor retrieval quality in production. Track the distribution of retrieval scores over time. If average scores drop, something changed — new documents that aren't chunked well, a shift in query patterns, or index degradation. Retrieval score monitoring is the RAG equivalent of model drift monitoring in ML systems.

What I'd Do Differently

Looking back on a year of building these systems:

I'd invest in the evaluation dataset earlier. We spent too long iterating on architecture without a rigorous way to measure improvement. Subjective assessment is unreliable. Build the eval set first, even if it's small. You can always expand it.

I'd start with hybrid search from day one. We began with pure vector search and bolted on BM25 later when we hit keyword-specific failures. Hybrid search should be the starting point, not an optimization.

I'd spend more time on document parsing and less on embedding model selection. The difference between a well-parsed, cleanly chunked document and a poorly parsed one dwarfs the difference between embedding models. If your PDF extraction is producing garbage, it doesn't matter how good your embeddings are.

And I'd set up guardrails before the first user touches the system, not after the first incident. In a regulated environment, the cost of a hallucinated answer in production is orders of magnitude higher than the cost of building validation into the pipeline from the start.

RAG is the most practical pattern for bringing LLMs into enterprise environments — it lets you ground generation in your own data without fine-tuning. But the gap between a working demo and a production system is significant, and most of it has nothing to do with the LLM. It's parsing, chunking, retrieval, evaluation, and guardrails. The unglamorous infrastructure that makes the difference between a system that impresses in a demo and one that holds up under real-world use.