2025 RAG Review and Outlook

Jan 26, 2026· Gnosnay

Recently I was selecting a RAG approach for a Q&A agent in one of our projects. I investigated a bunch of options, and this post is a review and a summary. (Damn—my co-research summary with ChatGPT didn’t turn out very satisfying. I still have to write it myself.)

This review starts with “what RAG is”: what classic RAG means, what makes it hard, and what challenges it faces. Then I’ll talk about what Agentic RAG is, and share one Agentic RAG design that I currently find reasonably solid.

What Is RAG

RAG (Retrieval-Augmented Generation) aims to provide LLMs with private/proprietary content to reduce hallucinations. A typical workflow looks like this:

graph LR %% Define node styles classDef greenBox fill:#d4edda,stroke:#333,stroke-width:1px,rx:5,ry:5; classDef yellowCyl fill:#fff3cd,stroke:#333,stroke-width:1px; classDef pinkUser fill:#f8d7da,stroke:#333,stroke-width:1px; classDef docIcon fill:#e2e6ea,stroke:#333,stroke-width:1px; classDef outputShape fill:#f3e5f5,stroke:#333,stroke-width:1px; %% Top: ingestion / indexing Docs[Dataset / Documents]:::docIcon EmbModel1[Embeddings<br/>Model]:::greenBox VecStore[(Vector<br/>Store)]:::yellowCyl %% Bottom: retrieval + generation User((User)):::pinkUser EmbModel2[Embeddings<br/>Model]:::greenBox SumModel[Summarization<br/>Model]:::greenBox FinalResult>Response]:::outputShape %% Edges and labels Docs -->|Content| EmbModel1 EmbModel1 -->|Document embeddings| VecStore User -->|query| EmbModel2 EmbModel2 -->|Query embeddings| VecStore VecStore -->|Relevant facts| SumModel SumModel --> FinalResult

Offline:

Compute embeddings for documents (or document chunks)
Store embeddings in a vector database / vector index

Online:

Convert the user question into an embedding
Retrieve relevant documents/chunks from the vector store using vector similarity search
Feed the retrieved content as context into the LLM so it can generate an answer

Intuitively, embeddings turn text into a list of numbers. Similarity is computed by vector distance/similarity.

Classic RAG vs. Traditional Search

Similarities

Whether it is search or RAG, the online pipeline usually includes:

Data ingestion → text processing → indexing
Query processing (tokenization / rewriting / expansion, etc.) → retrieval (topN) → ranking / re-ranking (topK)
Output (for humans to read or for models to consume)

“Retrieval + re-ranking” is a general paradigm: first use a high-throughput model to get candidates, then use a more expensive model to refine the ranking. ¹ ² ³ ⁴

Key Differences

Index structure and representation focus differ

Traditional search (lexical-first)

The canonical primary index is the inverted index: term → postings list (docid, term frequency, positions, etc.), enabling efficient boolean retrieval, phrase/proximity queries, BM25 scoring, and more. ⁵
Ranking is often based on BM25; a typical implementation is Lucene/Elasticsearch. ⁶

One-sentence explanations:
Inverted index: a mapping from words to documents, for quickly finding documents that contain a given word.
BM25 (Best Matching 25): a keyword-based retrieval algorithm that computes relevance using signals such as term frequency, inverse document frequency (rare terms weigh more), and document length normalization (long documents tend to match more terms, so penalize overly long documents). It is strong for exact/lexical matching and weaker for semantic similarity. Input is query + document statistics; output is a relevance score.

RAG (vector / passage-first)

Knowledge bases are often split into chunks/passages. Each chunk gets an embedding, and an approximate nearest neighbor (ANN) vector index is built for semantic retrieval. BM25 lexical indices are also commonly kept for hybrid retrieval. ⁷

Intuition: traditional search is more like “keyword matching + relevance”, while RAG is more like “semantic similarity + evidence passages”.

Different output goals

Traditional search outputs

The output is a list of documents/web pages, possibly with snippets; the user judges relevance by reading.

RAG outputs

The output is answer text. Retrieved passages are not necessarily shown directly (or only partially); instead they serve as external memory for the LLM during generation. That’s essentially what “retrieval-augmented generation” means. ⁷

This implies two must-have pieces in implementation:

Context construction (prompt assembly): which chunks to pick, how to truncate, and what order to place them in the context window.
Grounding / citation constraints: to reduce “sounds-plausible” hallucinations, RAG often requires answers to be traceable to retrieved evidence (e.g., forced citations, sentence-level evidence).

Re-ranking is more critical in RAG

Traditional search also uses LTR/neural re-ranking, but RAG correctness depends much more on the quality of topK evidence: if evidence is wrong, the LLM will confidently produce a wrong answer. Common patterns in RAG include:

bi-encoder retrieval (high throughput) → cross-encoder re-ranking (high precision) ⁸ ¹
or hybrid retrieval followed by re-ranking (Elastic also describes this as a typical advanced retrieval/re-ranking pipeline). ² ⁴

Cost, latency, and data security

Cost/latency: search costs mostly come from retrieval and ranking; RAG adds LLM inference (token cost, latency, concurrency pressure).
Data security: RAG sends retrieved text to a model (possibly an external service), so permission filtering, redaction, and auditing are usually more critical. Traditional search can simply return results to the user within the system boundary.

Challenges in RAG

Document processing

There are many document formats: txt, pdf, docx, html, markdown, etc. Each format requires different handling. PDFs are often non-standard and therefore tricky to process. Tables and images in documents are also hard to handle well.

A common approach is to use specialized document processing models for targeted extraction/normalization. This step is similar to data cleaning.

Chunking

Long documents are hard to work with, so we split them into chunks. The challenges are:

After splitting, a chunk’s context may be incomplete, reducing quality
Splitting can break a sentence or a logical unit

There is no perfect solution for chunking. In practice, we tune chunk size to control chunk length. ⁹ ¹⁰

Anyscale’s evaluation ¹¹ found chunk size can have a large impact on RAG performance. But increasing chunk size without limit will also degrade results.

Retrieval quality

Retrieval quality is arguably the most important part of RAG. It directly determines answer quality, and improving retrieval can significantly improve RAG outputs.

Some approaches that attempt to improve retrieval quality:

LangChain contextual compression ¹²
Anthropic contextual retrieval ¹³

Data security

Many RAG systems centralize data for unified processing. Moving scattered data into a vector store can bypass the original system’s access controls, introducing security and governance burdens. ¹⁴ ¹⁵

Some solutions try to handle this via Agentic RAG, i.e., dynamically fetching data from different systems at runtime.

Intent understanding

Intent understanding is a key part of RAG: it determines how to map a user question to the retrieval query. But user intent can be complex or subtle, making intent understanding hard.

Data refresh

Vectorized data needs periodic updates from source systems. This is usually done offline.

Overlong context

When composing the final answer, we still face overlong context. LLMs may suffer from “lost in the middle”. ¹⁶

Evaluation

How do we measure RAG quality? It’s very hard, and there is no perfect solution yet. The same problem also exists for agents. ¹⁷ ¹⁸

Multimodal RAG

Today, RAG mostly focuses on text, but many real-world cases require multimodal data such as images, video, and audio. ¹⁹ ²⁰

RAG Terminology

Offline (inputs):

Corpus: the collection (a bunch of documents, PDFs, web pages, etc.)
Document: a single file in the corpus
Chunk: a unit after splitting. It’s the basic unit for processing and retrieval; we embed and retrieve chunks.
Text splitter: a component that splits documents into chunks

Offline/online (processing):

Token: the smallest unit in LLM tokenization. Chunk size is often measured in tokens. Tokens do not map 1:1 to words; in ChatGPT, 1000 tokens roughly equals 750 English words or 400–500 Chinese characters.
Embedding: a numeric vector produced by sending a chunk into a bi-encoder; one chunk corresponds to one embedding.

Online (retrieval):

Context window: the maximum number of tokens an LLM can handle in one request. The total length of retrieved Top-K chunks must fit within this limit.
Hit / Node: in some frameworks (e.g., LlamaIndex), a retrieved chunk may be called a node or a hit.

Agentic RAG

We chose an Agentic RAG approach mainly for a few reasons:

We already have an internal RAG platform
Our data is “hot” (changes frequently), so a static RAG platform is not a great fit
We want finer control over what data a user is allowed to access

In essence, Agentic RAG is not fundamentally different from “agents”: both solve problems via an agent that can use tools. A rough online flow is: ²¹ ²² ²³

Intent understanding: understand the user question
The agent retrieves data online via tool use
The agent evaluates the data; if evaluation fails, retrieve again and re-evaluate until it passes ²⁴ ²⁵
The agent generates the final answer

Each step has plenty of pitfalls. Briefly, many challenges that exist in classic RAG also show up in Agentic RAG:

Retrieval: choosing a good tool to fetch data is crucial, but not easy. Some approaches build multi-agent systems ²⁶, which introduces many details: how to coordinate multiple agents, how they share context, whether each agent needs to re-interpret intent, etc. Once you connect tools, you also face tool governance: how to select from many tools efficiently, how to manage them, how to integrate external systems, and so on.
Intent understanding: the “hard forever” problem.
Context management
Result evaluation
Performance: token cost, ROI, latency budgets, etc.