2025 RAG Review and Outlook
Recently I was selecting a RAG approach for a Q&A agent in one of our projects. I investigated a bunch of options, and this post is a review and a summary. (Damn—my co-research summary with ChatGPT didn’t turn out very satisfying. I still have to write it myself.)
This review starts with “what RAG is”: what classic RAG means, what makes it hard, and what challenges it faces. Then I’ll talk about what Agentic RAG is, and share one Agentic RAG design that I currently find reasonably solid.
What Is RAG
RAG (Retrieval-Augmented Generation) aims to provide LLMs with private/proprietary content to reduce hallucinations. A typical workflow looks like this:
Offline:
- Compute embeddings for documents (or document chunks)
- Store embeddings in a vector database / vector index
Online:
- Convert the user question into an embedding
- Retrieve relevant documents/chunks from the vector store using vector similarity search
- Feed the retrieved content as context into the LLM so it can generate an answer
Intuitively, embeddings turn text into a list of numbers. Similarity is computed by vector distance/similarity.
Classic RAG vs. Traditional Search
Similarities
Whether it is search or RAG, the online pipeline usually includes:
- Data ingestion → text processing → indexing
- Query processing (tokenization / rewriting / expansion, etc.) → retrieval (topN) → ranking / re-ranking (topK)
- Output (for humans to read or for models to consume)
“Retrieval + re-ranking” is a general paradigm: first use a high-throughput model to get candidates, then use a more expensive model to refine the ranking. 1 2 3 4
Key Differences
Index structure and representation focus differ
Traditional search (lexical-first)
- The canonical primary index is the inverted index: term → postings list (docid, term frequency, positions, etc.), enabling efficient boolean retrieval, phrase/proximity queries, BM25 scoring, and more. 5
- Ranking is often based on BM25; a typical implementation is Lucene/Elasticsearch. 6
One-sentence explanations:
Inverted index: a mapping from words to documents, for quickly finding documents that contain a given word.
BM25 (Best Matching 25): a keyword-based retrieval algorithm that computes relevance using signals such as term frequency, inverse document frequency (rare terms weigh more), and document length normalization (long documents tend to match more terms, so penalize overly long documents). It is strong for exact/lexical matching and weaker for semantic similarity. Input is query + document statistics; output is a relevance score.
RAG (vector / passage-first)
- Knowledge bases are often split into chunks/passages. Each chunk gets an embedding, and an approximate nearest neighbor (ANN) vector index is built for semantic retrieval. BM25 lexical indices are also commonly kept for hybrid retrieval. 7
Intuition: traditional search is more like “keyword matching + relevance”, while RAG is more like “semantic similarity + evidence passages”.
Different output goals
Traditional search outputs
- The output is a list of documents/web pages, possibly with snippets; the user judges relevance by reading.
RAG outputs
- The output is answer text. Retrieved passages are not necessarily shown directly (or only partially); instead they serve as external memory for the LLM during generation. That’s essentially what “retrieval-augmented generation” means. 7
This implies two must-have pieces in implementation:
- Context construction (prompt assembly): which chunks to pick, how to truncate, and what order to place them in the context window.
- Grounding / citation constraints: to reduce “sounds-plausible” hallucinations, RAG often requires answers to be traceable to retrieved evidence (e.g., forced citations, sentence-level evidence).
Re-ranking is more critical in RAG
Traditional search also uses LTR/neural re-ranking, but RAG correctness depends much more on the quality of topK evidence: if evidence is wrong, the LLM will confidently produce a wrong answer. Common patterns in RAG include:
- bi-encoder retrieval (high throughput) → cross-encoder re-ranking (high precision) 8 1
- or hybrid retrieval followed by re-ranking (Elastic also describes this as a typical advanced retrieval/re-ranking pipeline). 2 4
Cost, latency, and data security
- Cost/latency: search costs mostly come from retrieval and ranking; RAG adds LLM inference (token cost, latency, concurrency pressure).
- Data security: RAG sends retrieved text to a model (possibly an external service), so permission filtering, redaction, and auditing are usually more critical. Traditional search can simply return results to the user within the system boundary.
Challenges in RAG
Document processing
There are many document formats: txt, pdf, docx, html, markdown, etc. Each format requires different handling. PDFs are often non-standard and therefore tricky to process. Tables and images in documents are also hard to handle well.
A common approach is to use specialized document processing models for targeted extraction/normalization. This step is similar to data cleaning.
Chunking
Long documents are hard to work with, so we split them into chunks. The challenges are:
- After splitting, a chunk’s context may be incomplete, reducing quality
- Splitting can break a sentence or a logical unit
There is no perfect solution for chunking. In practice, we tune chunk size to control chunk length. 9 10
Anyscale’s evaluation 11 found chunk size can have a large impact on RAG performance. But increasing chunk size without limit will also degrade results.
Retrieval quality
Retrieval quality is arguably the most important part of RAG. It directly determines answer quality, and improving retrieval can significantly improve RAG outputs.
Some approaches that attempt to improve retrieval quality:
Data security
Many RAG systems centralize data for unified processing. Moving scattered data into a vector store can bypass the original system’s access controls, introducing security and governance burdens. 14 15
Some solutions try to handle this via Agentic RAG, i.e., dynamically fetching data from different systems at runtime.
Intent understanding
Intent understanding is a key part of RAG: it determines how to map a user question to the retrieval query. But user intent can be complex or subtle, making intent understanding hard.
Data refresh
Vectorized data needs periodic updates from source systems. This is usually done offline.
Overlong context
When composing the final answer, we still face overlong context. LLMs may suffer from “lost in the middle”. 16
Evaluation
How do we measure RAG quality? It’s very hard, and there is no perfect solution yet. The same problem also exists for agents. 17 18
Multimodal RAG
Today, RAG mostly focuses on text, but many real-world cases require multimodal data such as images, video, and audio. 19 20
RAG Terminology
Offline (inputs):
- Corpus: the collection (a bunch of documents, PDFs, web pages, etc.)
- Document: a single file in the corpus
- Chunk: a unit after splitting. It’s the basic unit for processing and retrieval; we embed and retrieve chunks.
- Text splitter: a component that splits documents into chunks
Offline/online (processing):
- Token: the smallest unit in LLM tokenization. Chunk size is often measured in tokens. Tokens do not map 1:1 to words; in ChatGPT, 1000 tokens roughly equals 750 English words or 400–500 Chinese characters.
- Embedding: a numeric vector produced by sending a chunk into a bi-encoder; one chunk corresponds to one embedding.
Online (retrieval):
- Context window: the maximum number of tokens an LLM can handle in one request. The total length of retrieved Top-K chunks must fit within this limit.
- Hit / Node: in some frameworks (e.g., LlamaIndex), a retrieved chunk may be called a node or a hit.
Agentic RAG
We chose an Agentic RAG approach mainly for a few reasons:
- We already have an internal RAG platform
- Our data is “hot” (changes frequently), so a static RAG platform is not a great fit
- We want finer control over what data a user is allowed to access
In essence, Agentic RAG is not fundamentally different from “agents”: both solve problems via an agent that can use tools. A rough online flow is: 21 22 23
- Intent understanding: understand the user question
- The agent retrieves data online via tool use
- The agent evaluates the data; if evaluation fails, retrieve again and re-evaluate until it passes 24 25
- The agent generates the final answer
Each step has plenty of pitfalls. Briefly, many challenges that exist in classic RAG also show up in Agentic RAG:
- Retrieval: choosing a good tool to fetch data is crucial, but not easy. Some approaches build multi-agent systems 26, which introduces many details: how to coordinate multiple agents, how they share context, whether each agent needs to re-interpret intent, etc. Once you connect tools, you also face tool governance: how to select from many tools efficiently, how to manage them, how to integrate external systems, and so on.
- Intent understanding: the “hard forever” problem.
- Context management
- Result evaluation
- Performance: token cost, ROI, latency budgets, etc.
References
Using Cross-Encoders as reranker in multistage vector search. ↩︎ ↩︎
Advanced RAG Techniques: Hybrid Search and Re-ranking (dasroot). ↩︎ ↩︎
arXiv: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. ↩︎ ↩︎
arXiv: Rethinking Chunk Size for Long-Document Retrieval. ↩︎
Anyscale: A Comprehensive Guide for Building RAG-based LLM Applications Part 1. ↩︎
TechRadar: RAG is dead? Enterprises shifting to agent-based architectures. ↩︎
ACL Anthology: Multimodal Retrieval-Augmented Generation (MAGMaR 2025). ↩︎
mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation. ↩︎
Azure SQL: Improve the “R” in RAG and embrace Agentic RAG. ↩︎
Microsoft: Bonus Journey — Agentic RAG (Azure AI Foundry Blog). ↩︎
arXiv: A Collaborative Multi-Agent Approach to RAG Across Diverse Data. ↩︎