Your RAG Pipeline is Probably Overkill
Six months of RAG optimization. Query rewriting, reranking, hybrid search — the full playbook. We went from 60% to 70% accuracy extracting ESG metrics from annual reports (measured on a manually labeled evaluation set). Then someone asked: what if we just put the whole document in the context window?
85%.
That question kicked off a research project that became a PyData 2025 talk. This post covers the key findings: when long context windows beat RAG, where they fall apart, and what you should actually do about it.
Tip:
- Under 100k tokens? Skip RAG — context-only is simpler and performs as well or better.
- The 100k token quality cliff is real — performance degrades sharply with distractors and dissimilar phrasing (per Chroma’s research).
- Reranking doesn’t improve answer quality in our experiments, even though retrieval metrics improve.
- Use way more chunks than you think — k=50 outperformed k=5 or k=10 significantly.
The problem that started it all
At a bank, we needed to extract emissions data from annual reports for ESG analysis. Traditional RAG kept failing:
- Chunking destroyed cross-references between sections
- There was no standard ESG jargon across companies
- No good ground truth dataset existed for evaluation
Meanwhile, context windows were growing rapidly. In just two years, we went from 4k to over 1M tokens. An ABN AMRO annual report is around 500k tokens — it fits.

The Lord of the Rings trilogy? That fits too. But as we’ll see, fitting in the context window and actually understanding all of it are very different things.

So the natural question became: can we skip RAG entirely and just put the whole document in the context window?
The research questions
This research, building on Chroma’s Context Rot study, set out to answer five questions:
- How fast are LLMs at processing large context windows?
- When can we skip RAG entirely?
- Where’s the performance cliff as context grows?
- Does reranking still matter for modern LLMs?
- Is long context worth the cost?
Let’s go through each one.
Speed: how fast are LLMs with large context?
The common assumption is that attention scales quadratically with context length. Luckily, modern implementations do much better than that.

Latency starts climbing after 10k tokens
Across providers, there’s a clear inflection point around 10k tokens where latency starts increasing meaningfully. This is the speed threshold — distinct from the quality cliff at 100k tokens we’ll see later.

From 100k to 1M tokens, latency increases between 4x and 10x. At 100k tokens you’re looking at roughly 5 seconds; at 1M, that’s 20+ seconds.
Token throughput flattens out
While latency increases, token throughput (tokens per second) holds relatively steady rather than collapsing. This suggests the latency increase is roughly proportional to context size, not quadratic.

Google’s three-tier speed system
Google deserves special mention here. Their model lineup — Gemini Flash Lite, Flash, and Pro — creates a well-differentiated tiered system where lighter models are genuinely faster and all reliably scale to 1M tokens.

GPT and Claude don’t show this same clean tiering — their models cluster closer together in speed, with less predictable differentiation across context sizes.
Speed takeaways
- Scaling beyond 100k tokens is costly — expect 4-10x latency increase
- Gemini is often the fastest for large context workloads
- It’s better than quadratic, but still significant
Quality: how well do context windows actually work?
The findings in this section come from Chroma’s Context Rot research, which goes well beyond standard benchmarks. I’ll summarize the key experiments here, but the full report is well worth reading.
Needle-in-a-Haystack (NIAH) benchmarks look great on paper. You insert a fact into a long document, ask about it, and models nail it. But how well do they work for non-trivial tasks?
Experiment 1: What happens when the needle doesn’t look like the question?
Standard NIAH benchmarks typically have high cosine similarity between the question and the inserted answer. Real-world scenarios often don’t. You might ask about “carbon emissions targets” and the answer is buried in a paragraph about “Scope 3 downstream value chain assessments.”
The Chroma team split needles into two groups based on embedding similarity:
- Similar to the query (easy mode)
- Dissimilar to the query (real-world mode)

Key finding: Dissimilar question-answer pairs are challenging for all models, especially after 100k tokens. Smaller models degrade faster.
Experiment 2: The distractor problem
In real documents, there’s rarely just one relevant-looking passage. Consider a coding agent with 10 different versions of your updated function in the context window. Or an annual report where multiple sections discuss similar metrics in different contexts.
The Chroma team tested this with explicit distractors — passages similar to the answer but containing different (wrong) information:
Question: What colour was the duck I had as a child?
Needle: The duck I had when I was 10 was orange.
Distractors:
- My brother’s duck was blue
- The duck I had as an adult was purple
- The childhood pig was pink

The results are unambiguous: more distractors mean worse performance across all models. Smaller models degrade fastest, and even a single distractor reduces performance relative to the baseline.
Failure modes differ by model family
Interestingly, model families fail in different ways. Claude hallucinates the least — but this comes with a trade-off.

Experiment 3: Long conversational QA
For a more realistic test, the Chroma team used the LongMemEval dataset — 306 chat-based questions averaging ~113k tokens of context, compared against focused prompts with only ~300 tokens of relevant context.
Claude refuses to answer when in doubt. Is this good or bad? It reduces hallucination but also reduces recall.

Gemini performed the best overall, especially when using reasoning capabilities.

Quality takeaways
- Long context Q&A is very much unsolved — even at “only” 113k tokens
- Reasoning helps a lot (models with chain-of-thought do better)
- Hallucination prevention can backfire (Claude’s caution hurts recall)
- The 100k token threshold is where things start going wrong
Reranking: does it still matter?
Reranking has been a staple of RAG pipelines — retrieve broadly, then rerank to put the most relevant chunks first. But with modern LLMs handling noisy context better than ever, is it still necessary?
Experiment setup
We ran a comprehensive experiment:
- RAG types: Basic RAG and Enhanced RAG (query rewriting + expansion)
- Reranking: With and without
- Baseline: Full context window (no retrieval)
- 200 questions from grouped document chunks
- 1 to 50 chunks retrieved per query
- 3 runs each with GPT-4.1-mini and text-embedding-3-small
- ~35,000 total datapoints
You need more chunks than you think
The first surprise: hit rate (was the correct chunk even retrieved?) keeps climbing well past k=10 or k=20.
Performance saturates around 50 chunks — which is about 27% of the total chunks per document. For reference, these documents averaged ~27k tokens split into ~181 chunks of ~150 tokens each. That’s a lot more retrieval than the k=5 or k=10 that many tutorials suggest.
Reranking improves retrieval metrics but not answers
Here’s the provocative finding. Reranking clearly improves information retrieval metrics like MRR and Recall:
But when we look at what actually matters — did the model get the right answer? — reranking makes essentially no difference:
At least in our experiments, modern LLMs proved robust enough to find the relevant information in noisy retrieved context without needing it neatly sorted for them.
Speed comparison
For documents in this size range (~27k tokens per document), the speed between RAG and full context was surprisingly comparable.

That said, this comparison is for single documents. RAG’s core advantage is scaling to large corpora — and that advantage grows with corpus size. More complex RAG pipelines will also be slower (query rewriting, reranking steps add latency), but their cost scales linearly rather than with the full corpus size.
RAG takeaways
- You need much higher K than you think — 50 chunks saturated performance in our tests
- In our experiments, reranking did not improve answer quality — even though retrieval metrics improved
- Speed is comparable between RAG and full context for small-to-medium documents
Limitations
Before you go ripping out your RAG pipeline, some caveats:
- Limited query complexity: mostly single-hop questions in our RAG experiments
- No reranking-as-filtering: we didn’t test using reranker confidence scores to filter chunks
- Limited scale: max 339 chunks per document
- Limited model diversity: a small set of models tested
- Single embedding model: only text-embedding-3-small
A practical decision framework
Based on these findings, here’s when to use what:
Skip RAG when:
- Your domain fits in <100k tokens
- You have complex, multi-hop queries that need cross-referencing — chunking destroys these relationships even more than long context degrades them
- The simplicity gain of removing retrieval infrastructure matters to your team
Use RAG when:
- Your domain exceeds 100k tokens
- You’re dealing with simple, factual queries
- You need to search across a large corpus (RAG scales, context windows don’t)
And in both cases:
- If it fits in the context window, speed is likely comparable
- Use more chunks than you think (k=20-50, not k=5)
- Question your reranking step — it might not be helping
What to do Monday morning
- Audit your RAG pipeline — is your domain under 100k tokens? You might not need RAG at all.
- Try context-only for small domains — the simplicity gain is massive.
- Crank up K — run your existing eval set with k=50 and compare against your current k. The improvement may surprise you.
- A/B test removing reranking — measure answer quality, not just retrieval metrics. If correctness doesn’t change, you can drop the complexity.
Acknowledgments
The context window quality experiments in Section 2 come directly from Chroma’s excellent Context Rot article. Their work was a major inspiration for this talk and this post. Thanks to orq.ai for providing unified LLM API access and observability that made running the speed and reranking experiments across multiple providers feasible.
The full code and experiment data are available on GitHub.
This post is based on my PyData Amsterdam 2025 talk “Context is King: Your RAG Pipeline is Probably Overkill.” If you have questions or want to discuss your own RAG challenges, feel free to reach out on LinkedIn.