Your RAG Pipeline is Probably Overkill

Note: This research was conducted in August 2025. Model capabilities and pricing move fast; some numbers (latencies, throughput, accuracy cliffs) may no longer reflect the current generation of models.

Six months of RAG optimization. Query rewriting got us from 60% to 65%, reranking to 68%, hybrid search to 70% accuracy extracting ESG metrics from annual reports (measured on a manually labeled evaluation set). Each trick bought us two or three points. Then someone asked: what if we just put the whole document in the context window?

We got 85%.

That question kicked off a research project that became a PyData 2025 talk. This post covers the key findings: when long context windows beat RAG, where they fall apart, and what you should actually do about it.

Tip:

Under 100k tokens? Skip RAG. Context-only is simpler and performs as well or better.

The 100k token quality cliff is real. Performance degrades sharply with distractors and dissimilar phrasing (per Chroma’s research).

Reranking doesn’t improve answer quality in our experiments, even though retrieval metrics improve.

Use way more chunks than you think. 50 chunks outperformed 5 or 10 significantly.

The problem that started it all

We needed to extract emissions data from annual reports for ESG analysis. Traditional RAG kept failing:

Chunking destroyed cross-references between sections
There was no standard ESG jargon across companies
No good ground truth dataset existed for evaluation

Meanwhile, context windows were growing rapidly. In just two years, we went from 4k to over 1M tokens. An ABN AMRO annual report is around 500k tokens: it fits.

Bar chart comparing context window sizes across major LLM providers in 2025, showing Claude, Gemini, Llama, and Qwen all reaching 1M tokens.

That growth isn’t monotonic, though. Gemini’s 2M window was quietly cut back to 1M. Llama-4 Scout advertises 10M tokens but Meta’s own API only serves 128k and third parties cap it at 1M. Claude Sonnet 4’s 1M window isn’t yet in general availability. The headline numbers are real, but availability lags the announcement.

The Lord of the Rings trilogy? That fits too. But as we’ll see, fitting in the context window and actually understanding all of it are very different things.

Context window comparison showing that even the Lord of the Rings trilogy fits within modern LLM context windows.

So the natural question became: can we skip RAG entirely and just put the whole document in the context window?

The research questions

This research, building on Chroma’s Context Rot study, set out to answer five questions:

How fast are LLMs at processing large context windows?
When can we skip RAG entirely?
Where’s the performance cliff as context grows?
Does reranking still matter for modern LLMs?
Is long context worth the cost?

Let’s go through each one.

Speed: how fast are LLMs with large context?

The common assumption is that attention scales quadratically with context length. Luckily, modern implementations do much better than that.

Log-log and linear plots comparing actual provider latency scaling against theoretical quadratic references, showing real-world scaling is much better than quadratic.

Speed benchmark setup. Six models tested via the orq.ai proxy: gpt-5-mini, gpt-4.1-nano, gemini-2.5-flash, gemini-2.0-flash-lite-001, claude-sonnet-4-20250514, claude-3-haiku-20240307. Context sizes: exact powers of 10 (10, 100, 1k, 10k, 100k, 1M tokens), 3 iterations per point (108 API calls total). Prompts asked for a 1-2 sentence summary with max_tokens=50, temperature=0. Haystacks were built from Paul Graham essays + ArXiv papers, shuffled at the sentence level, and trimmed character-by-character against the gpt-4o tokenizer to hit the target length exactly. Caveat: the short output means these numbers are dominated by prefill, not generation — streaming UX will look different.

Latency starts climbing after 10k tokens

Across providers, there’s a clear inflection point around 10k tokens where latency starts increasing meaningfully. This is the speed threshold, distinct from the quality cliff at 100k tokens we’ll see later.

Line chart showing response duration increasing across OpenAI, Google, and Anthropic APIs as context size grows from 1k to 1M tokens, with a clear inflection after 10k tokens.

From 100k to 1M tokens, latency increases between 4x and 10x. At 100k tokens you’re looking at roughly 5 seconds; at 1M, that’s 20+ seconds.

Time-to-first-token tells a similar story

If you’re streaming responses, TTFT is what your user actually feels. It tracks total duration closely — prefill dominates.

Time-to-first-token per provider across context sizes, showing prefill-dominated latency that grows smoothly with context length.

Fitting exponential curves to the per-provider data makes the scaling behavior easier to compare directly:

Exponential fits per provider showing scaling coefficients; Gemini scales more gracefully than GPT and Claude at large context sizes.

Token throughput flattens out

While latency increases, token throughput (tokens per second) holds relatively steady rather than collapsing. This suggests the latency increase is roughly proportional to context size, not quadratic.

Token throughput (tokens/sec) per provider across context sizes, showing throughput holds steady rather than collapsing at larger context sizes.

Google’s three-tier speed system

Google deserves special mention here. Their model lineup (Gemini Flash Lite, Flash, and Pro) creates a well-differentiated tiered system where lighter models are genuinely faster and all reliably scale to 1M tokens.

Gemini Flash Lite, Flash, and Pro throughput across context sizes, showing a clean three-tier speed differentiation.

GPT and Claude don’t show this same clean tiering; their models cluster closer together in speed, with less predictable differentiation across context sizes.

Speed takeaways

Scaling beyond 100k tokens is costly. Expect 4-10x latency increase
Gemini is often the fastest for large context workloads
It’s better than quadratic, but still significant

Quality: how well do context windows actually work?

The findings in this section come from Chroma’s Context Rot research, which goes well beyond standard benchmarks. I’ll summarize the key experiments here, but the full report is well worth reading.

Needle-in-a-Haystack (NIAH) benchmarks look great on paper. You insert a fact into a long document, ask about it, and models nail it. But how well do they work for non-trivial tasks?

Experiment 1: What happens when the needle doesn’t look like the question?

Standard NIAH benchmarks typically have high cosine similarity between the question and the inserted answer. Real-world scenarios often don’t. You might ask about “carbon emissions targets” and the answer is buried in a paragraph about “Scope 3 downstream value chain assessments.”

The Chroma team split needles into two groups based on embedding similarity:

Similar to the query (easy mode)
Dissimilar to the query (real-world mode)

Needle-in-a-haystack accuracy for similar vs dissimilar question-answer pairs across context sizes, showing dissimilar pairs degrade sharply after 100k tokens. Source: Chroma Context Rot.

Key finding: Dissimilar question-answer pairs are challenging for all models, especially after 100k tokens. Smaller models degrade faster.

Experiment 2: The distractor problem

In real documents, there’s rarely just one relevant-looking passage. Consider a coding agent with 10 different versions of your updated function in the context window. Or an annual report where multiple sections discuss similar metrics in different contexts.

Context rot in the wild

If you’ve used Claude Code, Cursor, or any agentic coding tool on a long session, you’ve already seen this failure mode. The agent creates a v2 of a function, then a v3, then patches the original, then forgets which version is canonical. It writes a checkpoint file, then starts over from scratch the next turn. It does an incomplete grep, decides nothing exists, and reimplements what it missed. These aren’t bugs in the tool — they’re what happens when eight competing versions of “the right answer” sit in the same context window and the model picks whichever one is most salient right now. Which is exactly the distractor problem, just at production scale.

The Chroma team tested this with explicit distractors: passages similar to the answer but containing different (wrong) information.

Diagram explaining the distractor setup: one correct needle plus semantically similar but factually wrong distractor passages inserted into a long haystack.

Question: What colour was the duck I had as a child?

Needle: The duck I had when I was 10 was orange.

Distractors:

My brother’s duck was blue

The duck I had as an adult was purple

The childhood pig was pink

Chart showing model accuracy declining as the number of distractors increases, with smaller models degrading fastest. Source: Chroma Context Rot.

The results are unambiguous: more distractors mean worse performance across all models. Smaller models degrade fastest, and even a single distractor reduces performance relative to the baseline.

Failure modes differ by model family

Model families fail in different ways. Claude hallucinates the least, but this comes with a trade-off.

Stacked bar chart comparing failure modes (hallucination, refusal, wrong answer) across Claude, GPT, and Gemini model families. Source: Chroma Context Rot.

Experiment 3: Long conversational QA

For a more realistic test, the Chroma team used the LongMemEval dataset: 306 chat-based questions averaging ~113k tokens of context, compared against focused prompts with only ~300 tokens of relevant context. Questions span several types — single-session preference, temporal reasoning, knowledge update, multi-session, and more — which matters because full-context performance isn’t uniform across them.

Breakdown of LongMemEval question types and counts, showing the mix of single-hop factual, temporal, multi-session, and knowledge-update questions. Source: Chroma Context Rot.

Claude refuses to answer when in doubt. Is this good or bad? It reduces hallucination but also reduces recall.

Claude LongMemEval results showing high refusal rate with full context, trading recall for reduced hallucination. Source: Chroma Context Rot.

GPT sits in the middle — less cautious than Claude, less capable than Gemini with reasoning.

GPT LongMemEval results across question types, showing middle-of-the-pack behavior between Claude's refusal-heavy style and Gemini's reasoning-boosted performance. Source: Chroma Context Rot.

Gemini performed the best overall, especially when using reasoning capabilities.

Gemini LongMemEval results showing strong performance across question types, especially with reasoning enabled. Source: Chroma Context Rot.

Quality takeaways

Long context Q&A is very much unsolved, even at “only” 113k tokens
Reasoning helps a lot (models with chain-of-thought do better)
Hallucination prevention can backfire (Claude’s caution hurts recall)
The 100k token threshold is where things start going wrong

Reranking: does it still matter?

Reranking has been a staple of RAG pipelines: retrieve broadly, then rerank to put the most relevant chunks first. But with modern LLMs handling noisy context better than ever, is it still necessary?

Experiment setup

We ran a full experiment:

RAG types: Basic RAG and Enhanced RAG (LLM query rewriting + dual retrieval + Reciprocal Rank Fusion)
Reranking: With and without, using cross-encoder/ms-marco-MiniLM-L-6-v2
Baseline: Full context window (no retrieval)
Answer model: GPT-4.1-mini (temp=0, max_tokens=500), prompt: “Answer the question based on the retrieved context.”
Embeddings: text-embedding-3-small
Chunking: 500 tokens with 50-token overlap
200 questions grouped into context-size bins: 0–25k, 25–75k, 75–150k tokens
1 to 50 chunks retrieved per query
3 runs each, ~35,000 total datapoints

Ground truth came from the LongMemEval “focused” gold spans: for each question, we scored every chunk in ChromaDB (~50k chunks) against the gold span using Jaccard similarity (≥0.65 threshold), ROUGE-L, and token containment. Triangulating across three metrics is stronger than cosine-only and removes the “how did you label ground truth?” objection.

Flow diagram of the RAG evaluation pipeline: document chunking, dual retrieval, optional RRF fusion and reranking, GPT-4.1-mini answer, LLM-as-judge scoring.

How we scored correctness

Every answer went through an LLM-as-judge (GPT-4.1, temp=0, structured Pydantic output) that produced two independent scores:

is_correct — is the answer right compared to the ground truth?
is_correct_given_context — is the answer reasonable given what was retrieved?

The gap between the two is diagnostic. If is_correct_given_context stays high but is_correct drops, retrieval missed the relevant chunk. If both drop together, the model couldn’t use the context it had. This is how we know reranking isn’t helping generation — we can measure retrieval-quality and generation-quality separately.

You need more chunks than you think

The first surprise: hit rate (was the correct chunk even retrieved?) keeps climbing well past k=10 or k=20.

Hit rate (percentage of queries where the correct chunk was retrieved) climbing steadily from k=1 to k=50 for both basic and enhanced RAG.

Performance saturates around 50 chunks, which is about 27% of the total chunks per document, and answer correctness plateaus near 92%. For reference, these documents averaged ~27k tokens split into ~181 chunks of ~150 tokens each; k=50 corresponds to roughly 20k retrieved tokens, or about 40–50 pages of text. That’s a lot more retrieval than the k=5 or k=10 that many tutorials suggest.

Reranking improves retrieval metrics but not answers

Here’s the surprising finding. Reranking clearly improves information retrieval metrics like MRR and Recall:

Mean Reciprocal Rank (MRR) with and without reranking across chunk counts, showing clear improvement from reranking.

Recall@10 with and without reranking, showing reranking improves retrieval recall.

But when we look at what actually matters (did the model get the right answer?), reranking makes essentially no difference:

Answer correctness with and without reranking across chunk counts, showing virtually no difference despite improved retrieval metrics.

The same null result shows up if you slice correctness across RAG type × reranker × chunk count as a heatmap:

Heatmap of answer correctness across RAG type, reranker on/off, and chunk count. Reranking rows and no-reranking rows are virtually indistinguishable at every column.

At least in our experiments, modern LLMs proved good enough at finding the relevant information in noisy retrieved context without needing it neatly sorted for them.

An adjacent surprise: shuffling sometimes helps

Since ordering doesn’t seem to matter the way the reranking story assumed, we also looked at what happens when you actively shuffle retrieved chunks. In several conditions it slightly improves correctness — consistent with Chroma’s observation that attention patterns vary by position and that “put the best chunk first” isn’t always the right prior.

Effect of shuffling retrieved context order on answer correctness, showing small but positive improvements in several configurations.

Speed comparison

For documents in this size range (~27k tokens per document), the speed between RAG and full context was surprisingly comparable.

Bar chart comparing query latency between RAG with different chunk counts and full context window, showing comparable speed for documents under 30k tokens.

That said, this comparison is for single documents. RAG’s core advantage is scaling to large corpora, and that advantage grows with corpus size. More complex RAG pipelines will also be slower (query rewriting, reranking steps add latency), but their cost scales linearly rather than with the full corpus size.

RAG takeaways

You need much higher K than you think. 50 chunks saturated performance in our tests
In our experiments, reranking did not improve answer quality, even though retrieval metrics improved
Speed is comparable between RAG and full context for small-to-medium documents

Limitations

Before you go ripping out your RAG pipeline, some caveats:

Limited query complexity: mostly single-hop questions in our RAG experiments
No reranking-as-filtering: we didn’t test using reranker confidence scores to filter chunks
Limited scale: max 339 chunks per document
Limited model diversity: a small set of models tested
Single embedding model: only text-embedding-3-small

A practical decision framework

Based on these findings, here’s when to use what:

The 100k cutoff isn’t arbitrary. The Natural Questions dataset — a common realistic-QA benchmark — has documents averaging ~77k tokens with a stdev of ~55k. Most real QA documents land below 100k; the cliff is where they start to hurt.

Skip RAG when:

Your domain fits in <100k tokens
You have complex, multi-hop queries that need cross-referencing; chunking destroys these relationships even more than long context degrades them
The simplicity gain of removing retrieval infrastructure matters to your team

Use RAG when:

Your domain exceeds 100k tokens
You’re dealing with simple, factual queries
You need to search across a large corpus (RAG scales, context windows don’t)

And in both cases:

If it fits in the context window, speed is likely comparable
Use more chunks than you think (k=20-50, not k=5)
Question your reranking step. It might not be helping

What to do Monday morning

Audit your RAG pipeline. Is your domain under 100k tokens? You might not need RAG at all.
Try context-only for small domains. The simplicity gain is massive.
Crank up K. Run your existing eval set with k=50 and compare against your current k. The improvement may surprise you.
A/B test removing reranking. Measure answer quality, not just retrieval metrics. If correctness doesn’t change, you can drop the complexity.

Acknowledgments

The context window quality experiments in Section 2 come directly from Chroma’s excellent Context Rot article. Their work was a major inspiration for this talk and this post. Thanks to orq.ai for providing unified LLM API access and observability that made running the speed and reranking experiments across multiple providers feasible.

The full code and experiment data are available on GitHub.

This post is based on my PyData Amsterdam 2025 talk “Context is King: Your RAG Pipeline is Probably Overkill.” If you have questions or want to discuss your own RAG challenges, feel free to reach out on LinkedIn.