AI // Retrieval2026-05-296 min read

RAG Is Not a Silver Bullet — It's a Retrieval Problem

Why most retrieval-augmented generation systems underperform, and the unglamorous fixes that actually move the needle.

Varun Raj ManoharanFounder & Principal Engineer

RAGRetrievalArchitectureEvals

Key takeaways

Most RAG systems underperform because of retrieval, not the language model, so fix retrieval before rewriting the system prompt.
Measure recall directly by checking whether the correct chunk appears in the top-k for questions you already know the answers to.
Chunk on structure like headings and paragraphs, run keyword search alongside vector search, and rerank candidates with a cross-encoder.
Skip RAG for questions that span an entire corpus, such as finding themes across 10,000 tickets, and use a real pipeline or fine-tune instead.

Overview

RAG gets pitched as the antidote to hallucination. Wire a vector store up to your model, point it at your own corpus, and watch it answer questions grounded in your data. Clean story. Then you ship the thing, a user asks something whose answer is sitting in paragraph two of a doc you indexed last week, and the model invents a confident, plausible, completely wrong answer instead.

Here's the part nobody wants to hear: the model is rarely the bottleneck. Retrieval is. RAG is a search engine with an LLM stapled to the output, and most teams I've watched build these systems sink ninety percent of their effort into the LLM half — the prompt, the temperature, the model swap from one frontier vendor to the next — and almost none into the search. That's backwards.

Bad retrieval, bad answers

The generator works with the context you hand it. Full stop. If the right chunk never lands in the top-k, prompt-tuning is rearranging deck chairs — you can't summarize text the model never saw. So where does retrieval actually break? A few repeat offenders I keep running into.

Fixed-size character chunking. You split on 1,000 characters with a 200-character overlap because that's the LangChain default, and it cheerfully guillotines a sentence mid-clause, orphaning the half that carried the meaning. Then there are off-the-shelf embeddings that have never seen your domain — a generic model like text-embedding-3-small maps "claim" in an insurance policy and "claim" in a git commit log into roughly the same neighborhood of its 1,536-dimensional space, because to the model they're the same token. They are not the same thing. And single-shot retrieval: one query, top result, done. The first time I really got bitten by this was a support bot where a user typed "it won't connect" — the embedding pulled in five chunks about database connection pooling, the actual answer was a Wi-Fi pairing step three docs over, and the model dutifully hallucinated a fix from the wrong context. Ambiguous query in, garbage neighborhood out. Single-shot retrieval falls over the moment the question isn't phrased the way your docs are.

Fix retrieval before you touch the prompt

When RAG underperforms, the reflex is to open the prompt file and start tinkering. Resist it. Not yet, anyway. Instrument retrieval in isolation first — treat it as its own subsystem with its own metrics, divorced from the generator entirely.

Measure recall directly. Build a little eval set — fifty, a hundred questions where you already know which chunk holds the answer — and ask the only question that matters: does the gold chunk land in the top-k? If recall@10 is sitting at sixty percent, the generator was doomed forty percent of the time before it read a single token, and no system prompt on earth fixes that. Fix the retrieval, then measure again. Most teams get this wrong; they tune the part they can see instead of the part that's failing.

So what actually moves the number? Chunk on structure — split on headings, paragraphs, semantic boundaries, whole ideas — not a blind character window. Run lexical search alongside vector search; BM25 catches the exact strings (SKUs, error codes, ERR_CONNECTION_REFUSED, a config flag name) that dense embeddings smear into mush, and a hybrid of the two beats either alone in basically every pipeline I've shipped. Then rerank. Pull a wide net — k of 50, even 100 — and let a cross-encoder reorder them down to the top 5 the model actually sees. A bi-encoder embeds query and document separately and prays they line up; a cross-encoder reads both together and scores the pair, so it catches relevance the vector distance missed. It costs you a few hundred milliseconds. I'd skip reranking only if latency is genuinely sacred and recall is already strong — otherwise it's the highest-leverage thing on this list, full stop.

When an answer comes back wrong, don't theorize. Log the retrieved chunks and go read them. Nine times in ten the verdict is obvious on sight — the answer's sitting right there in the context and the model fumbled it, or it's plainly absent and retrieval is your culprit. Either way you've localized the bug in thirty seconds instead of an afternoon.

When not to reach for RAG

RAG shines when the answer lives in one place and your job is to find it. Needle in a haystack. It's terrible at the opposite shape — questions that demand you reason over the whole haystack at once. "What are the top three recurring themes across these 10,000 support tickets?" There's no top-k for that. The answer isn't in any single chunk; it's an aggregate, an emergent property of the full set, and stuffing twenty random fragments into a context window won't reconstruct it. That's a map-reduce summarization pipeline, or a fine-tune, or honestly just a SQL query and some clustering. Wrong tool, wrong shape.

The short version

RAG isn't a feature you toggle on. It's a search system wearing an LLM as a hat, and you have to build it, instrument it, and tune it like a search system — eval set, recall numbers, hybrid retrieval, a reranker earning its keep. Do the unglamorous work upstream, hold retrieval to its own metrics, and only reach for the pattern when the problem genuinely fits the shape. Generation was never the hard part. It just gets all the attention.