Foundations

What is Retrieval-Augmented Generation (RAG)?

Also called: RAG, agentic RAG

Updated June 24, 2026
Quick Definition

Retrieval-Augmented Generation (RAG) is a technique that grounds a model’s answer in external information fetched at query time. Rather than relying only on what the model learned during training, a retrieval step finds relevant documents and places them in the prompt, so the response is conditioned on current, specific sources.

Why RAG matters

A language model’s knowledge is fixed at training time and limited to what it absorbed. It does not know your internal documents, it cannot see anything published after its cutoff, and when it lacks a fact it may produce a fluent but wrong answer. For many applications, this gap between what the model knows and what the task requires is the central obstacle.

RAG narrows that gap without retraining. By retrieving relevant text and supplying it as context, the model can answer from authoritative, up-to-date material and can point to the sources it used. This makes responses easier to keep current and to verify, and it is often cheaper and faster to update an index than to fine-tune a model whenever information changes.

How it works

A basic RAG pipeline runs in two phases — preparing the data, then answering a query:

  1. Indexing splits source documents into chunks, converts each chunk into a vector embedding, and stores them in a vector index.
  2. Retrieval embeds the user’s query and finds the chunks whose embeddings are most similar, returning the closest matches.
  3. Augmentation inserts those retrieved chunks into the prompt alongside the question.
  4. Generation has the model produce an answer grounded in the supplied passages, ideally citing them.

The quality of the output depends heavily on retrieval: if the relevant passage is not found, the model cannot use it, so chunking, embedding choice, and ranking all matter as much as the model itself.

RAG vs. agentic RAG

Classic RAG is a fixed pipeline — retrieve once, then generate — and it works well when a single lookup answers the question. Agentic RAG puts an agent in control of retrieval instead. The agent decides whether retrieval is needed at all, reformulates the query, may search several times or across different sources, and judges whether the results are sufficient before answering. Here retrieval becomes a tool the agent calls within its reasoning loop, which suits complex questions that no single query can resolve.

In practice

In a durable, observable runtime, retrieval is exposed as a tool the model can call, and long-term knowledge lives in semantic memory that the agent queries when it becomes relevant. Each retrieval is recorded as a step, so what was fetched and used is visible after the fact. RAG is one of the mechanisms that backs an agent’s memory, and choosing what retrieved context to include is a context engineering decision; in the agentic form, the model invokes retrieval through ordinary tool use. See semantic memory and retrieval for the building blocks.

Frequently asked questions

What is agentic RAG?

Agentic RAG is retrieval driven by an agent rather than a fixed pipeline. The agent decides whether to retrieve, how to phrase the query, whether to search again, and which sources to trust, treating retrieval as a tool it can use repeatedly instead of a single step.

What is the difference between RAG and fine-tuning?

Fine-tuning changes a model's weights so knowledge is baked in at training time. RAG leaves the model unchanged and supplies knowledge at query time through the prompt, which makes it easier to keep information current and to cite sources.

Does RAG eliminate hallucinations?

No. RAG reduces them by grounding answers in retrieved evidence, but the model can still misread a passage, combine sources incorrectly, or answer from training data when retrieval returns nothing useful. It lowers the risk rather than removing it.

See also in the docs

Related terms