The reader has a decision in front of them. Vendors now ship million-token context windows with near-perfect needle-in-a-haystack recall, and the implication is loud: retire the retriever, paste the documents, ship the answer. Fabio Akita put the strongest version of the case in April 2026, in Is RAG Dead? Long Context, Grep, and the End of the Mandatory Vector DB: the model is now “big, relatively fast, and cheap,” the best coding agent on the market does not use a vector database, and in most cases “a well-aimed grep plus a generous context window beats a full RAG stack.” Akita is also honest enough to include a section, titled in his own words, that lists where the thesis does not hold: massive corpora, vocabulary mismatch in customer support, non-textual modalities, critical latency, and compliance auditing.
This piece is the case for reading both halves of Akita and adding one more. The launch posts that prove the long-context capability publish one benchmark. The architectural decision turns on a second benchmark, mostly in the open literature.
Long context made one half of retrieval cheap. The other half, selection, provenance, citations, access control, freshness, is where the work still is.
The rule, upfront
Use long context when the answer is inside a bounded object the model can read in one pass. Use retrieval when the system must choose what to read, cite where it came from, filter for what is current, or govern who can ask. The first is a model capability. The second is a control plane.
The rest of this piece is the case for why that distinction holds at the benchmark level and what it changes about production architecture.
The two surfaces
A long-context workload runs on one of two surfaces, or a mixture of both.
The first surface is lookup. A single fact lives somewhere in a long document or a long set of documents. The model finds it. The classic test for this surface is needle-in-a-haystack (NIAH): embed a sentence at depth X in a corpus of Y tokens, ask the model the question the sentence answers, score recall. Multi-needle variants extend this: embed several facts, ask several questions, score all of them. Anthropic’s MRCR v2 is one such variant. Google’s MRCR, used in the Gemini reports, is another. These are real benchmarks. They are honest about what they measure. They measure lookup.
The second surface is synthesis. Facts live across many documents. The model has to join them. It has to resolve contradictions, weight sources, decide which spans support which claims, and return a grounded answer with citations the reader can verify. The class of workloads here includes earnings analysis across multiple filings, legal review across contract families, incident post-mortems across logs and tickets, and most enterprise question-answering that is not pure fact retrieval. Synthesis includes lookup as a sub-step but does not reduce to it.
Most production workloads are mixtures. A query that starts as lookup (“what is our customer’s contracted SLA”) often unfolds into synthesis (“and does that SLA hold given last week’s incident pattern and the regional change in the September amendment”). The model does the lookup well. The synthesis is the open question.
NIAH and MRCR measure the lookup surface. The decision about whether to retire a retriever depends on the synthesis surface. Reading the launch post tells you about the first. It does not tell you enough about the second.
Where the vendor launches stop
The vendor evidence for the lookup surface is honest and strong. Read it on the vendor’s own pages.
Anthropic’s Claude Opus 4.6 launch post reports that “on the 8-needle 1M variant of MRCR v2 […] Opus 4.6 scores 76%, whereas Sonnet 4.5 scores just 18.5%.” The post is precise: the benchmark name, the variant, the context length, the score, the comparison. This is good documentation.
The same post contains two charts. One is labelled “Long-context retrieval.” The other is labelled “Long-context reasoning.” The retrieval chart is anchored by the named benchmark and the number above. The reasoning chart is presented without a named benchmark or a quoted score in the page text. The visual is there. The auditable number for reasoning is not.
The Claude Sonnet 4.6 launch post goes further in the same direction. The post advertises that “Sonnet 4.6’s 1M token context window is enough to hold entire codebases, lengthy contracts, or dozens of research papers in a single request.” It then offers a qualitative claim: “More importantly, Sonnet 4.6 reasons effectively across all that context.” No NIAH number. No MRCR number. No synthesis number. The reader is asked to trust the prose.
Google’s Gemini 1.5 technical report was the document that opened the era. It reports “99.7% recall up to 1M tokens” and “99.2% accuracy up to 10M tokens” on the multi-modal NIAH evaluation. Both numbers are real. Both measure lookup.
This is not deception. Vendors are reporting what they have measured cleanly. The reader’s job is to notice what is on the page and what is not. NIAH and its multi-needle variants are the dominant surface on the launch page. The synthesis surface is usually less auditable from the launch page itself.
What the literature measures instead
Where the vendor pages stop, the open literature continues. Three results sit squarely on the synthesis surface.
NoLiMa (Modarressi et al., ICML 2025) paraphrases the needle so the model cannot rely on lexical overlap. At 32K, eleven of the tested models drop below 50% of their short-length baseline. GPT-4o falls from 99.3% to 69.7%. The same models score near-perfect on lexical NIAH at 1M.
NoCha (Karpinska et al., EMNLP 2024) tests claim verification over full novels. No open-weight model exceeds random chance. GPT-4o reaches 55.8%. The benchmark asks one true-or-false question per claim. The best closed model is barely above a coin flip on a question a careful reader of the book would answer in seconds.
BABILong (Kuratov et al., NeurIPS 2024) measures multi-step reasoning embedded in long context. The headline: “popular LLMs effectively utilize only 10-20% of the context.” The window is a million tokens. The effective working set is far smaller.
Two further results in the same direction are worth knowing. RULER reports that “only half” of models advertising 32K-or-greater windows maintain satisfactory performance at 32K. Lost in the Middle remains the canonical reference for position-dependent degradation: middle-of-context information is recovered less reliably than start-or-end information, and the architectures have moved without removing the effect.
These results are not a takedown of long context. They are the second benchmark sitting alongside the vendor NIAH numbers, not against them. The literature keeps showing the same failure mode in different forms: models retrieve needles better than they synthesize evidence.
Routing in practice
The earlier upfront rule turns concrete when you put queries to it. The table below sits next to the diagram on purpose. The table is “given a query, which architecture.” The diagram is “what each architecture actually looks like.”
| Query type | Better default | Why |
|---|---|---|
| ”Find clause X in contract Y” | Long context | Bounded object, direct lookup |
| ”Compare clauses across 200 contracts” | Retriever + long context | Selection plus synthesis |
| ”Answer from weekly changing policies” | RAG | Freshness, citations, governance |
| ”Search codebase for usage pattern” | grep / BM25 + long context | Lexical structure matters |
| ”Explain customer exposure across systems” | RAG + tools | Access control and provenance |
Three cases, decided by the workload, not by the headline.
Long context comfortably replaces RAG when: the workload is genuine single-document lookup or single-corpus retrieval; the corpus fits inside one window with headroom for the question; the query rate is low enough that per-request input cost pencils out against the retrieval infrastructure it would replace; latency budgets accommodate large-prompt processing; and an internal eval shows the model resolves the workload correctly on your data at the context length you actually run. Akita is largely right for this case. The coding-agent example is the cleanest one.
Long context complements RAG when: the workload is multi-hop synthesis, the corpus exceeds one window, queries are concurrent, citation grounding is required for downstream review, or the eval scores span retrieval rather than answer presence. The retriever finds the relevant documents. The long window holds them while the model joins them. The two are not competing; they are layered.
RAG stays load-bearing when: documents update faster than re-prompting can amortise; the model needs to cite chunks with verifiable provenance; the team needs the retrieval layer for governance, audit, or access control; or the corpus is large enough that the per-request input bill at long-context tier exceeds the all-in cost of running a retriever. Most enterprise governance regimes live here. The retrieval layer is not just a cost optimisation; it is a control surface.
Note that “the workload” is plural in most teams. The same product can have lookup queries that route to long context and synthesis queries that route to RAG-augmented context. The architecture decision is per query class, not per company.
It is also worth keeping a distinction Akita makes implicitly: “retire the vector database” is not the same claim as “retire the retriever.” A grep-based retriever, a BM25 retriever, a hybrid retriever, and a vector retriever are different products with different cost shapes. Akita’s piece is best read as a case against mandatory vector databases. The case for retrieval as a control surface (citations, access control, governance) survives whichever index you choose.
A short routing example makes the rule concrete. “Pull the SLA from contract X” routes cleanly to long context if the contract fits the window. “Compare the SLA across X, Y, and Z and flag drift from the master template” routes to a retriever-first path with citation grounding. “Answer a customer support question from a policy library that updates weekly” stays on RAG. Same product. Three different paths.
What to measure on Monday
Four concrete moves before the next architecture review.
Run a small NoLiMa-style or RULER-style probe on your own corpus before retiring the retriever. The repos are public (NoLiMa, RULER). Plan for a small pilot, not a quarter-long programme: a representative sample of your corpus, a paraphrased-needle harness, and a scoring rubric your team already trusts. The output is a number on your data at your context length. That number does the decision work the launch post cannot.
Diff a long-context answer against a RAG-augmented answer on ten production queries. Score for span citations and contradiction handling, not impression. Many long-context answers read well and cite nothing the reader can verify. A RAG answer that cites three chunks and a long-context answer that cites prose are different products.
Measure your p95 input-token bill at the long-context tier on representative traffic. As of May 2026, the current Sonnet 4.6 pricing page keeps the 1M tier at standard rates, whereas the original Sonnet 4 1M launch carried a roughly 2x surcharge above 200K tokens. The trajectory is favourable but not free; re-check the page before re-architecting. Run the numbers at your actual context-length distribution, not the marketing one. Pricing checked: May 2026.
Keep the retriever path warm even when adopting long context for some query classes. The two paths are complementary load lines, not a single cost line. A team that fully deprecates retrieval will pay to rebuild it the first time a customer asks for citations.
What to read past
Binary headlines (“RAG is dead”, “RAG is back”) compress a workload-dependent decision into a slogan. Akita’s own piece concedes both “where the thesis holds” and “where it doesn’t.”
Vendor launches that lead with retrieval numbers and stop. The retrieval number is doing the work the vendor measured. The synthesis number is in the literature.
“Long-context prompt” marketplaces that bundle entire corpora into a single template. The strategy is not always wrong. It is usually unmeasured, and the bundle is rarely sized to the model’s effective working set per BABILong’s 10-20% finding.
The reframe
The deeper shift is not about whether RAG dies. The deeper shift is what retrieval becomes.
In 2024, retrieval was mostly a context-window workaround. The model could not hold the corpus, so you indexed, ranked, and fed it slices. The whole stack existed because the window was small.
In 2026, retrieval is becoming a control plane. Selection, provenance, citations, access control, freshness, cost routing. These are not workarounds for window size. They are first-class production concerns that exist whether the window is 8K or 10M. Long context made one of them cheaper. It did not make the others go away.
A reader who holds the two benchmarks in view (lookup and synthesis) can adopt long context where it pays, keep retrieval where it pays, and run mixed paths where the workload is mixed. Akita’s title is a good question. The honest answer is: not yet, not for everything, and the missing question is no longer “can the model find the needle.” It is “can the system produce a grounded answer when the evidence is distributed, contradictory, changing, and access-controlled.”