RAG in 2026: From Pipeline to Agent
In 2024, enterprise RAG was a pipeline: load, chunk, embed, retrieve, generate, in a straight line. In 2026 it is a loop. The model decides what to retrieve, reads what comes back, decides whether it has enough, and searches again if it does not. Retrieval stopped being a stage you run once and became a tool the model calls as many times as the question needs.
I wrote two years ago about what breaks when you move RAG from a tutorial to production: parsing, chunking, retrieval quality, evaluation, and the guardrails that matter when a wrong answer is a compliance incident, not a demo glitch. Rereading it now, the surprising part is how little I would retract. The things that broke then still break now. What changed is the shape of the system built around them.
That one change touches everything. Here is what moved, what held, and what to do about it.
The Pipeline Became a Loop
The 2024 architecture was a funnel. Hybrid search cast a wide net, re-ranking narrowed it, the top chunks went into a prompt, the model generated an answer. One pass. Retrieval happened before the model ever engaged with the question, which meant the quality of the whole system rested on a single guess at what to fetch.
The 2026 architecture puts the model inside the loop. It reads the question, decides what to search for, looks at the results, and makes a second decision: is this enough to answer, or do I search again with different terms? A question about one policy might resolve in a single retrieval. A question that spans three documents and a definitional edge case might take five. The system spends effort in proportion to the question instead of spending the same fixed budget on all of them.
This is what people mean by agentic retrieval, and the cleanest way I have heard it put is that what died was naive RAG, not retrieval. Single-shot vector search dumped into a prompt was always a compromise. We did it because it was simple, not because it was right. Letting the model run the retrieval loop is slower and costs more per query, and for a large class of questions it is worth it, because the failure mode of the old design was silent. If the first guess missed, you got a confident answer built on the wrong context and no signal that anything had gone wrong.
In 2024 retrieval ran once on a fixed budget. In 2026 the model runs the loop and decides when it has enough.
The cost is real and worth naming. A loop is harder to reason about than a line. It can run long, retrieve too much, or talk itself into a bad search. You are no longer debugging one retrieval. You are debugging a trajectory.
Long Context Did Not Kill RAG, It Moved the Boundary
The obvious objection in 2026 is that none of this matters because context windows got enormous. Gemini and Claude both take a million tokens, and some models advertise several times that. Why retrieve at all when you can put the whole corpus in the prompt?
For some problems, you should. If your corpus is small, clean, and fits comfortably in the window, retrieval is overhead you do not need. Load the document and ask. The honest version of the "is RAG dead" argument is right about this narrow case, and plenty of 2024-era RAG systems were solving a problem they no longer have.
But two things are true that the headline misses. First, the usable window is smaller than the advertised one. A model that accepts a million tokens does not attend to all of them evenly. Retrieval quality inside the context degrades well before you reach the limit, and the degradation is quiet, which is the dangerous kind. A model that silently ignores the middle of a long document is harder to catch than a retriever that returns nothing. Second, and this is the part that matters in a regulated environment, the window does not solve the problems retrieval was actually solving. It does not scale to a corpus that does not fit. It does not handle data that changes every week. It does not give different answers to users with different permissions. And it does not tell you which sentence the answer came from.
That last one is the whole game in banking. Take a question about whether a counterparty breaches a concentration limit. The answer has to name the policy, the clause, and the version in force, because a reviewer will check it. An answer without a citation is not an answer I can ship. Long context can hold every policy document at once. Ask it to cite, and the model is self-reporting which clause it used, which is not the same as an audit trail. Retrieval gives you an evidence boundary instead: this answer came from these clauses, in these versions, fetched under these permissions. So the boundary moved. Long context took the small, static, low-stakes end of the spectrum. Retrieval kept the large, changing, auditable end. Most regulated enterprise problems live on that second end.
Retrieval Is Still the Bottleneck, but the Retriever Got Smarter
I wrote in 2024 that retrieval quality, not generation quality, was the bottleneck. If the right information is not in the retrieved context, no amount of prompt engineering saves you. That is more true now, not less, because the model does more of its own reasoning and is even more dependent on being handed the right material to reason over.
What changed is where the intelligence sits. In 2024 the smart part of retrieval was a fixed query-understanding layer you built by hand: expand the acronyms, rewrite the query for search, maybe decompose it into parts. In 2026 that work moved into the model. It decomposes the question itself, issues several searches, and reformulates when the first ones come back thin. The query-understanding layer I once described as a separate component is now just something the agent does in the loop.
The substrate underneath did not change. Hybrid search, combining keyword and vector retrieval, is still the practical default, and for the same reason as before: semantic search misses exact terms and keyword search misses paraphrases, and you need both. Re-ranking still earns its place at the top of the funnel. And the embedding-model question that teams agonized over in 2024 matters even less now than it did then, which was already less than people thought. The leverage was never the embedding model. It was giving the retriever good text and good structure to work with, then letting something more discerning than cosine similarity make the final call.
The PDF Problem Got a Real Answer
The loudest complaint in the 2024 piece was parsing. PDFs are the worst. The same page produces different text depending on the extraction tool, and a table flattened into a run of numbers loses the meaning the layout carried. Teams spent real budget turning documents into clean text, and the cleanup was where a lot of quality quietly leaked away.
The most interesting thing to happen to enterprise retrieval since then is the idea that you can skip that step. Vision-based retrieval, the approach ColPali introduced in 2024, embeds the image of the page directly instead of the extracted text. It uses a vision language model to turn each page into a set of embeddings and a late-interaction step to match the query against the visual content, the same family of technique as the ColBERT re-ranking I mentioned in the original piece, applied to pixels instead of tokens. A table stays a table. A chart no parser would have read correctly becomes retrievable. The layout, which carried meaning all along, is preserved because you never threw it away.
The technique began as a 2024 research result and has since moved into vector-database tooling, which is the usual sign something crossed from paper to practice. It does not make parsing obsolete everywhere. Text extraction is still cheaper at scale and fine for documents that are mostly prose. But for the dense, structured, layout-heavy documents that fill financial services, the filings and term sheets and forms where the structure is the content, being able to retrieve the page as it actually looks is the first answer to the parsing problem that feels like progress rather than a better workaround.
Evaluation Got Harder, Because Agents Have Trajectories
I called evaluation the hardest part in 2024, and I would say it louder now. The reason is the loop. When RAG was one pass, evaluation had two questions: did the retriever return the right chunks, and did the model write a faithful answer. Both were hard, but both were static. You could build a set of question, answer, and source triples and measure against it.
An agent does not give you one retrieval to grade. It gives you a path. Did it search for the right things, in a sensible order? Did it know when it had enough and stop, instead of retrieving five more times and burying the answer in noise, or stopping early and missing the document that mattered? A trajectory can reach a correct answer through a process you would never sign off on, and it can fail a question it should have handled because it went down a bad branch on the first search. Grading the answer alone no longer tells you whether the system works.
The tooling improved. Using a strong model as a judge to score faithfulness and relevance is more practical than it was, and it is genuinely useful for catching regressions across a loop. But it brought its own problem, which is that you are now trusting one model to grade another, and the judge has failure modes of its own. In the systems I care about, automated metrics still buy you regression detection and nothing more. The quality gate that lets something ship into a regulated workflow is still a human reviewing a sample, tracing not just whether the answer was right but whether the path to it was sound. That is more expensive than it was in 2024, because there is more to look at. It is also more necessary, for the same reason.
Guardrails Moved From Checking the Output to Controlling the Actions
In 2024 the guardrail was a check on one answer. Did it cite a source. Did it contradict the retrieved context. Did it stay inside the system's domain. The dangerous failure was a fluent, confident, wrong sentence, and you validated against it at the end.
Agents widened the surface. A system that runs a retrieval loop and calls tools is not making one decision you can inspect at the end. It is taking a series of actions, each of which can retrieve something it should not, pull in a document the user has no right to see, or call a tool with real consequences. The guardrail can no longer live only at the output. It has to live at every action the agent takes.
This is where the connective tissue of the 2026 stack, the Model Context Protocol that now standardizes how models reach tools and data, becomes a governance question as much as an engineering one. Standardizing the plumbing standardized the attack surface with it. Every tool an agent can call is a place where access control, logging, and policy have to hold, and "the model decided to" is not a sentence that survives an audit. The hard half of agent retrieval in a regulated environment is no longer making the answer faithful. It is constraining what the system is allowed to do in the first place, and proving afterward that it stayed inside the line.
What to Actually Do
The argument is only useful if it changes what you build. Here is the checklist I would hand a team building enterprise retrieval in 2026, grouped by where the work lands. Most of it is the 2024 advice, still standing. The agent loop adds the rest.
Retrieval and ingestion
- Start with hybrid search, not as an optimization you add later. Vector similarity plus keyword (BM25) with rank fusion. Vector alone misses exact terms like a regulation code; keyword alone misses paraphrases. You need both from day one.
- Spend on parsing and chunking before you shop for embedding models. A clean parse and structure-aware chunks beat a better embedding model every time. The embedding choice is the smallest lever in the chain, and teams still over-invest in it.
- Add a re-ranker early. A cross-encoder over the top candidates is the single highest-leverage addition to a naive pipeline. It is cheap to bolt on and it moves the number that matters: the rank of the right passage.
- Filter on metadata before the search runs. Effective date, jurisdiction, document type. It keeps superseded versions out of the candidate set, which is where a lot of confident-but-wrong answers are born.
- For layout-heavy documents, try vision retrieval before you pour budget into PDF parsing. If the structure is the content, retrieve the page as an image instead of fighting an extractor for it.
- Version your index. Treat it as a deployment artifact so you can A/B a chunking change and roll back a regression, instead of mutating it in place and hoping.
Long context vs retrieval
- Decide per use case, not once for the whole company. Small, clean, static, low stakes goes in the window. Large, changing, permissioned, auditable gets retrieved. Most teams run both.
- Do not trust the full advertised window. Assume usable context is smaller than the headline number, and test recall at different positions before you rely on it.
The agent loop
- Add the loop only when questions are genuinely multi-step. A single-policy lookup does not need one. A loop you do not need is latency, cost, and surface area bought for nothing.
- Cap it. Set a maximum number of retrieval steps and a token budget so one bad run cannot spiral into fifty searches.
- Make the stopping decision observable. Log why the agent judged it had enough. That is the line a reviewer will question first.
Evaluation
- Build the eval set first, even a small one. A few hundred question, answer, and source triples beats subjective assessment, and you can grow it from validated production queries.
- Measure retrieval and generation separately. Recall and rank for the retriever, faithfulness for the answer. They fail for different reasons and the fixes are different.
- Grade the trajectory, not just the answer. For an agent, log what it searched, in what order, and when it stopped. A right answer reached through a bad path will not stay right.
- Use a model judge for regressions, a human for the ship gate. Automated faithfulness scoring catches drift cheaply. A sampled human review is still what clears a regulated workflow.
Guardrails
- Make citation a hard requirement. Every claim grounded in a retrieved passage, and "I do not have enough to answer that" allowed as an output. It beats a confident hallucination every time.
- Escalate on low confidence. When retrieval scores fall below a calibrated threshold, route to a human instead of generating a speculative answer.
- Put controls at every tool call, not just the final output. Access control, logging, and policy on each action the agent can take. Tool access is the new boundary.
- Log enough to reconstruct any answer. In a regulated setting you have to show what was retrieved and why, after the fact. Build that in before the first user, not after the first incident.
Long context for the small and static. Retrieval for the auditable. The loop only when the question is genuinely multi-step.
Related
- RAG in Production: What Breaks When You Move Past the Tutorial (the 2024 piece this one follows).
- Why Agent Strategy Becomes an Architecture Problem at Scale
- Your ML Risk Framework Wasn't Built for GenAI. Here's What's Missing.
RAG is still the most practical way to bring an LLM into an enterprise, grounding generation in your own data without fine-tuning. What changed in two years is that the grounding is no longer a line you lay down in advance. It is a loop the model drives. The work that decides whether the system holds up is the same as it ever was, parsing, retrieval, evaluation, and guardrails, and it still has almost nothing to do with the model. The system got smarter. The discipline did not get easier. If anything it got harder, because now you have to account for what the system decided to do, not only what it said.