You have shipped a RAG system. It retrieves relevant chunks, injects them into a prompt, and the LLM generates a response. In the demo, it looks sharp. In production, it starts failing the moment a user asks anything that requires more than a single lookup: multi-part questions, comparative analysis across sources, follow-up queries that depend on what was just retrieved.
Agentic RAG is the architectural response to that failure mode. Instead of a fixed retrieve-then-generate pipeline, an autonomous agent sits at the center of the system — planning a retrieval strategy, routing queries to the right tools, evaluating its own outputs, and iterating until it has a grounded, coherent answer. The Anthropic research team's work on building effective agents and OpenAI's findings on agentic tool use both converge on the same conclusion: for complex knowledge tasks, agentic orchestration outperforms linear pipelines by a significant margin.
This guide is for ML engineers, AI architects, and technical decision-makers evaluating whether to build or buy an agentic RAG system in 2026. If you are new to RAG fundamentals, start with our RAG business guide first — this article assumes you already understand embeddings, vector retrieval, and the standard query pipeline.
TL;DR
- Traditional RAG fails on multi-step, ambiguous, or cross-source queries — the retrieval is static, the pipeline is one-shot
- Agentic RAG adds a reasoning loop: query decomposition, tool routing, self-evaluation, and iterative re-retrieval
- Five patterns cover 95% of enterprise use cases: router, ReAct, plan-and-execute, multi-agent retrieval, and self-RAG
- Evaluation is non-trivial — you need faithfulness, context precision, and answer relevance metrics, not just BLEU
- Cost and latency increase with agentic depth — most production systems use a hybrid: agentic orchestration for complex queries, standard RAG for simple ones
- LangGraph and LlamaIndex Workflows are the leading open-source orchestration layers; managed platforms like Heeya remove the orchestration burden entirely
Table of Contents
- Traditional RAG Limits in 2026
- What "Agentic" Adds: Planning, Multi-Step Retrieval, Self-Correction
- Architecture Deep-Dive: Query Decomposition, Tool Router, Reflection Loop
- Five Agentic RAG Patterns for Enterprise
- Traditional RAG vs Agentic RAG: Comparison Table
- Evaluation: How to Know Your Agentic RAG Actually Works
- Cost and Latency Tradeoffs
- Build with LangGraph/LlamaIndex vs Buy
- Heeya's Agentic RAG Approach
- Further Reading
- FAQ
Traditional RAG Limits in 2026
Standard RAG has a well-understood pipeline: embed the query, retrieve top-k chunks by cosine similarity (optionally fused with BM25), rerank, inject into a prompt, generate. For factoid questions with a clear answer in the corpus, it works well. For anything more complex, the architecture has four structural failure modes.
Failure mode 1: Single-pass retrieval on multi-part questions
When a user asks "What changed in our refund policy since the Q3 2025 update, and how does it compare to what our competitor announced last month?", a single retrieval pass cannot adequately cover both sub-questions. The retrieval vector is a blend of all concepts in the query, which means it is not optimally aligned with any of them. The LLM receives a context window that partially addresses each part and generates a response that is incomplete at best, confabulated at worst.
Failure mode 2: No mechanism to detect retrieval failure
Standard RAG has no feedback loop. If the retrieved chunks are irrelevant — because the query was ambiguous, the corpus doesn't cover the topic, or the embedding model underperformed — the LLM still generates a response. It just does so with bad context. There is no step that asks: "did I actually find what I needed?"
Failure mode 3: Inability to route across heterogeneous sources
Enterprise knowledge is rarely in one collection. An internal assistant might need to consult an HR policy database, a product specification index, a CRM, and a SQL database — all for a single query. Standard RAG retrieves from one vector store. Choosing which source to query requires a layer of reasoning that the pipeline does not have.
Failure mode 4: Stateless retrieval in multi-turn conversations
When a user asks a follow-up like "What about for enterprise customers?" after a complex exchange, standard RAG must either embed the full conversation history (which degrades precision) or lose context. Neither approach handles multi-turn retrieval elegantly. Benchmarks from the MTEB leaderboard (2025 Q4 edition) show that retrieval recall drops by 18–34% on multi-turn queries compared to single-turn queries in standard RAG pipelines.
What "Agentic" Adds: Planning, Multi-Step Retrieval, Self-Correction
Agentic RAG replaces the static pipeline with a reasoning loop. The LLM is no longer just a generator — it becomes an orchestrator that decides what to retrieve, evaluates what it got, and determines whether to generate a response or try again.
Anthropic's engineering blog describes this as the difference between a "tool user" and a "planner" — a system that can decompose a goal into subtasks, execute them in sequence or parallel, and adapt based on intermediate results. OpenAI's Assistants API formalizes the same pattern through the run/step model, where each step can invoke a retrieval tool, a code interpreter, or a function call, with the model deciding the sequence.
The four agentic capabilities
1. Query decomposition and planning
Before touching the vector store, the agent analyzes the incoming query and decomposes it into atomic sub-questions. "How has our enterprise pricing changed since 2024, and what is the ROI calculation for a 500-seat deployment?" becomes two independent retrieval tasks with a synthesis step. This is the foundational capability — without it, the other three have nothing to orchestrate.
2. Dynamic tool routing
The agent selects the right retrieval tool for each sub-question. Semantic vector search for open-ended conceptual questions. BM25 keyword search for product codes, reference numbers, or named entities. SQL queries for structured data. Web search for real-time information outside the corpus. Claude's tool use API and OpenAI's function calling both support this pattern natively — the agent declares the available tools and the model decides which to invoke.
3. Self-evaluation and reflection
After retrieval, the agent scores the retrieved context for relevance before passing it to the generation step. If the relevance score is below threshold, it reformulates the query and tries again. This is the mechanism described in the Self-RAG paper (Asai et al., 2023) — a "critique" token that evaluates retrieval quality inline. In LangGraph implementations, this step is typically a separate node in the graph with its own prompt.
4. Multi-agent collaboration
For the highest-complexity queries, specialized sub-agents handle different retrieval domains or reasoning tasks. A coordinator agent decomposes the query, dispatches to domain experts (a "legal documents" agent, a "financial data" agent, a "product specs" agent), and synthesizes the results. The LangGraph multi-agent pattern and LlamaIndex's agent workflows both implement this as a graph of agents with explicit message-passing interfaces.
Architecture Deep-Dive: Query Decomposition, Tool Router, Reflection Loop
Here is a concrete implementation sketch of a production agentic RAG pipeline using LangGraph-style graph notation. This is the pattern used by teams building on LangChain/LangGraph — the same concepts apply in LlamaIndex Workflows and in frameworks that use Claude or GPT-4 as the orchestrating model.
# Agentic RAG: core graph structure (LangGraph-style pseudocode)
# Nodes
def decompose_query(state):
"""LLM call: break complex query into sub-questions."""
sub_questions = llm.invoke(decompose_prompt + state["query"])
return {"sub_questions": sub_questions, "iteration": 0}
def route_and_retrieve(state):
"""For each sub-question, select tool and retrieve."""
results = []
for q in state["sub_questions"]:
tool = tool_router.select(q) # "vector_search" | "bm25" | "sql" | "web"
chunks = tool.retrieve(q, top_k=5)
results.append({"question": q, "chunks": chunks})
return {"retrieval_results": results}
def evaluate_retrieval(state):
"""Score each result set. Flag low-confidence retrievals."""
evaluations = []
for r in state["retrieval_results"]:
score = relevance_scorer.score(r["question"], r["chunks"])
evaluations.append({**r, "score": score, "retry": score < THRESHOLD})
return {"evaluations": evaluations, "iteration": state["iteration"] + 1}
def conditional_retry(state):
"""Route: retry failed retrievals (up to MAX_ITER), else synthesize."""
needs_retry = any(e["retry"] for e in state["evaluations"])
if needs_retry and state["iteration"] < MAX_ITER:
# Reformulate failed sub-questions and loop back
return "rewrite_and_retrieve"
return "synthesize"
def synthesize(state):
"""Assemble context from all evaluations and generate final response."""
context = build_context(state["evaluations"])
response = llm.invoke(rag_prompt + context + state["query"])
return {"response": response}
# Graph edges
graph.add_edge("decompose_query", "route_and_retrieve")
graph.add_edge("route_and_retrieve", "evaluate_retrieval")
graph.add_conditional_edges("evaluate_retrieval", conditional_retry)
graph.add_edge("synthesize", END)
The key difference from standard RAG is the conditional edge after evaluate_retrieval. This is what makes the system agentic: it can loop. The loop is bounded by MAX_ITER (typically 2–3 in production) to prevent runaway latency. Anthropic's documentation on tool use with Claude covers the mechanics of how the model orchestrates these tool calls within a single API interaction.
Five Agentic RAG Patterns for Enterprise
Not every use case needs a full multi-agent reflection loop. Here are the five patterns in order of complexity, with guidance on when each is appropriate.
Pattern 1: Router Agent
A single agent analyzes the incoming query and routes it to the appropriate data source or retrieval tool — without decomposition or iteration. The simplest agentic upgrade over standard RAG. Appropriate when you have multiple well-defined knowledge domains (e.g., "HR policy," "product specs," "legal") and the query can be classified into one of them.
Implementation: a classification call (LLM or fine-tuned classifier) + source dispatch. LlamaIndex's RouterQueryEngine implements this pattern directly. Response latency overhead vs. standard RAG: ~200–400ms for the classification step.
Pattern 2: ReAct (Reason + Act)
The agent interleaves reasoning traces ("I need to find the 2024 pricing table, then check whether any exceptions were documented") with tool invocations in a single generation loop. Popularized by Yao et al. (2022) and now the default agent pattern in LangChain's AgentExecutor and Claude's extended thinking mode. Good for open-ended queries where the retrieval strategy cannot be determined upfront.
When to use: research-style queries, troubleshooting workflows, competitive analysis. Higher latency than the router pattern (multiple LLM calls per response) but significantly better on ambiguous queries.
Pattern 3: Plan-and-Execute
A planner LLM generates a full retrieval plan before any tool is invoked. An executor then carries out each step independently, potentially in parallel. The planner and executor can be different models (a larger model for planning, a smaller one for execution) to manage cost. This is the pattern used in OpenAI's Assistants API with parallel tool calls enabled.
When to use: structured analytical workflows with a predictable number of steps (e.g., "collect data from sources A, B, C, then summarize"). Lower variance in latency than ReAct because the plan is fixed upfront.
Pattern 4: Multi-Agent Retrieval
A coordinator agent dispatches sub-queries to specialized retrieval agents, each maintaining their own vector collection and tool set. Results are returned to the coordinator for synthesis. This is what LangGraph's multi-agent swarm pattern implements, and what Anthropic describes as the "subagent" architecture in their agents research.
When to use: enterprise deployments with heterogeneous knowledge silos — separate vector stores for different business units, languages, or document types. The overhead of inter-agent communication is justified when no single agent can competently handle all retrieval domains. HR deployments, for example, often require separate agents for policy Q&A, leave management, and recruitment workflows — see how this applies in practice in our guide on AI chatbot for recruitment and CV screening.
Pattern 5: Self-RAG
The model generates retrieval decisions inline using special control tokens or structured outputs: whether to retrieve at all, whether the retrieved passages are relevant, whether the final response is supported by the evidence. Based on the Self-RAG paper (Asai et al., 2023), this pattern reduces hallucinations significantly on knowledge-intensive tasks — the KILT benchmark showed a 13-point improvement in faithfulness over standard RAG.
When to use: high-stakes response contexts where faithfulness is critical (legal, medical, financial). Requires either a fine-tuned model with self-RAG training or careful prompt engineering to elicit self-critique behavior from a general model.
Pattern selection guide
| Pattern | Query complexity | Sources | Latency overhead | Best frameworks |
|---|---|---|---|---|
| Router | Low — classifiable | 2–5 distinct silos | +200–400ms | LlamaIndex RouterQueryEngine |
| ReAct | Medium — open-ended | Any | +1–3s per loop | LangChain AgentExecutor, Claude tool use |
| Plan-and-Execute | Medium — structured | 3–10, parallelizable | +2–5s total | LangGraph, OpenAI Assistants API |
| Multi-Agent | High — cross-domain | Many heterogeneous | +5–15s | LangGraph swarm, LlamaIndex Workflows |
| Self-RAG | Any — high-stakes | Single or multiple | +30–60% per call | Custom / fine-tuned models |
Traditional RAG vs Agentic RAG: Comparison Table
| Dimension | Traditional RAG | Agentic RAG |
|---|---|---|
| Pipeline structure | Fixed: retrieve → rerank → generate | Dynamic graph with conditional loops |
| Multi-part queries | Single retrieval pass, incomplete coverage | Decomposed into sub-queries, each retrieved independently |
| Retrieval failure handling | None — generates from bad context | Self-evaluation triggers reformulation and retry |
| Source routing | Single vector store | Dynamic routing across vector, BM25, SQL, web |
| Multi-turn accuracy | Degrades with context length | Stateful — maintains retrieval context across turns |
| Latency | Low — single LLM call | Higher — multiple calls, bounded by MAX_ITER |
| Cost per query | Predictable and low | Higher, varies with query complexity |
| Implementation complexity | Low — standard library support | High — requires orchestration framework |
| Best for | Factoid Q&A, simple support queries | Research, analysis, multi-domain enterprise queries |
Evaluation: How to Know Your Agentic RAG Actually Works
Standard RAG evaluation is already non-trivial. Agentic RAG evaluation is harder because you now have multiple intermediate steps that can fail independently, and the overall answer quality is the product of every step's performance.
The RAGAS framework
RAGAS (Retrieval-Augmented Generation Assessment) is the most widely adopted evaluation framework for RAG systems in 2026. Its four core metrics map directly onto agentic RAG failure modes:
- Faithfulness — Is every factual claim in the generated response supported by the retrieved context? Catches the case where the LLM adds information not present in the chunks.
- Answer relevance — Does the response actually answer the original question? High faithfulness with low relevance means the agent retrieved accurate but off-topic content.
- Context precision — What fraction of the retrieved chunks were actually relevant to the query? Low precision means your router or retrieval step is returning noise.
- Context recall — Was all the information needed to answer the question present in the retrieved context? Low recall means the agent missed relevant documents.
Agentic-specific metrics
Beyond RAGAS, agentic systems need additional metrics at the orchestration layer:
- Tool selection accuracy — Did the agent invoke the correct tool(s) for each sub-question? Measure with a gold-standard test set of query-tool pairs.
- Decomposition quality — Did the planner correctly identify all sub-questions needed to answer the original query? Evaluate by manually labeling whether each decomposition covers the full answer scope.
- Loop termination rate — How often does the reflection loop reach MAX_ITER without converging? A high rate signals that either your retrieval coverage is insufficient or your relevance threshold is miscalibrated.
- End-to-end latency distribution — Track P50, P90, and P99 latency, not just the mean. Agentic systems have heavy tails — the P99 can be 5–10x the P50 for complex queries.
Testing strategy
Build a golden dataset of 100–200 queries covering your query complexity distribution: simple factoid, multi-part, cross-source, and adversarial (queries designed to trigger retrieval failure). Run your agentic pipeline against this dataset weekly and track metric trends over time — not just absolute values. A drop in context recall after a knowledge base update, for example, immediately surfaces a chunking or embedding regression.
Cost and Latency Tradeoffs
Agentic RAG's power comes with a real cost. Understanding the tradeoff surface helps you make intelligent architectural decisions rather than over-engineering or under-engineering.
Where the cost comes from
Each agentic step that involves an LLM call adds input and output token cost. In a ReAct loop with three iterations, you might use 4–6 LLM calls for a single user query — versus one call in standard RAG. At GPT-4o pricing ($2.50/1M input tokens) or Claude Sonnet pricing ($3/1M input), this adds up quickly at scale. For a system handling 100,000 queries per month, moving from standard RAG to a three-step ReAct agent can increase LLM cost by 3–5x.
Practical mitigation strategies
- Complexity routing: run a fast classifier (fine-tuned small model or embedding similarity against query complexity labels) to route simple queries to standard RAG and only invoke the agentic pipeline for queries above a complexity threshold. This alone typically reduces agentic path usage to 20–30% of queries.
- Smaller models for orchestration: use a lightweight model (GPT-4o mini, Claude Haiku) for the planning and evaluation steps, and reserve the larger model only for the final synthesis step. Anthropic's benchmarks show Haiku performs within 5% of Sonnet on structured reasoning tasks like query decomposition when given a well-designed prompt.
- Caching decomposition results: many enterprise queries are structurally similar. Cache the decomposed sub-question structure for query patterns, using semantic similarity to match incoming queries to cached plans. LangChain's
SemanticCacheand LlamaIndex's cache layer support this. - Bounded iteration: set MAX_ITER to 2–3, not open-ended. Most of the quality gain from reflection comes in the first retry — subsequent iterations have diminishing returns and linear cost increases.
Latency expectations
In production systems with complexity routing, you should target:
- Simple queries (standard RAG path): P50 < 800ms, P99 < 2s
- Complex queries (agentic path, 2 iterations): P50 < 4s, P99 < 12s
- Multi-agent queries: P50 < 8s — communicate wait state to the user via streaming
Streaming partial responses (where available in your LLM provider's API) dramatically improves perceived latency even when wall-clock time is high.
Build with LangGraph/LlamaIndex vs Buy
This is the question that drives the most architecture reviews. There is no universally correct answer — but the decision criteria are clear.
Build with LangGraph or LlamaIndex when:
- You have specific retrieval logic that no managed platform supports (custom embedding models, proprietary data formats, on-premise vector stores)
- Your query patterns are highly specialized and benefit from custom orchestration logic that cannot be expressed in a general framework
- You have ML engineers with LangChain/LlamaIndex experience and the bandwidth to maintain the orchestration layer
- You need full control over data handling for regulatory reasons (SOC 2, FedRAMP, healthcare data that cannot leave your VPC)
LangGraph is the more mature choice for stateful multi-agent workflows — its graph abstraction maps cleanly onto the conditional-edge patterns described in this article. LlamaIndex Workflows are better suited for document-centric pipelines where the ingestion and query layers are tightly coupled. Both have active development communities and extensive documentation as of 2026.
The operational cost is real: a production agentic RAG system built in-house requires maintenance of the orchestration logic, the embedding pipeline, the vector store, the evaluation framework, and the monitoring stack. Engineering teams consistently underestimate this by a factor of 2–3x in initial estimates.
Buy (managed platform) when:
- Your use case fits the platform's target model (customer support, internal knowledge base, lead qualification)
- Your team does not have ML engineers and needs a system operational in days, not months
- GDPR compliance and EU data residency are requirements that you want handled at the infrastructure level, not added on top — see our EU AI Act chatbot compliance guide for the specific obligations this creates for customer-facing AI deployments
- You want to iterate on knowledge base content — adding, removing, and updating documents — without triggering re-indexing pipeline changes
For a detailed analysis of when to build vs buy for the broader agentic AI category, see our guide on agentic AI in enterprise.
Heeya's Agentic RAG Approach
Heeya implements the router agent pattern by default — the simplest agentic upgrade over standard RAG, and the one that covers the vast majority of enterprise use cases without introducing unnecessary latency or complexity.
When you create an agent on Heeya and upload your knowledge base (PDFs, DOCX files, website URLs, or scraped content), the platform handles the full indexing pipeline: document parsing, semantic chunking, embedding with production-grade models, and storage in an isolated Qdrant collection. Each agent has its own collection — a hard isolation boundary for multi-tenant deployments.
At query time, the system runs hybrid retrieval (dense semantic + BM25 sparse), reranks with a cross-encoder, and injects the top-k passages into the generation prompt. For multi-turn conversations, a query rewriting step reformulates the user's message into a standalone question before retrieval — solving the multi-turn degradation problem described in the traditional RAG failure modes above.
The agentic layer adds tool routing: the agent can invoke a vector search tool for open-ended questions, a form-capture tool when a user signals interest in being contacted, and (on enterprise plans) custom function calls for CRM or calendar integrations. The orchestration is managed — you configure the agent's behavior through system guidance, not by writing graph code. For guidance on writing system prompts that reliably shape agentic behavior, see our system prompt engineering guide for AI agents.
For teams that need the full custom orchestration layer, Heeya is not the right fit — you should build with LangGraph or LlamaIndex. For teams that need a production-ready agentic RAG system with EU data residency, GDPR-compliant infrastructure, and no ML team required, see the Heeya agent platform or review pricing.
More on how Heeya handles the full RAG stack — from document ingestion to conversational retrieval — in our guide on RAG for customer service.
Further Reading
- What Is RAG? A Business Guide — foundational explainer on Retrieval-Augmented Generation before going agentic
- RAG for Customer Service in 2026 — practical applications of RAG and agentic patterns in support contexts
- Agentic AI and Autonomous Agents in Enterprise 2026 — broader landscape of agentic systems beyond RAG
- AI Agent vs Chatbot: Key Differences in 2026 — when an agent is meaningfully different from a rule-based chatbot
- ChatGPT vs Custom RAG Chatbot — why generic LLMs fail where grounded RAG systems succeed
- System Prompt Engineering for AI Agents 2026 — how to write system guidance that shapes agentic behavior reliably
- Best AI Chatbot Platforms in 2026 — comparison of managed platforms with agentic RAG capabilities
FAQ
What is agentic RAG?
Agentic RAG is an evolution of standard Retrieval-Augmented Generation where an autonomous LLM agent orchestrates the retrieval process rather than following a fixed pipeline. The agent decomposes complex queries into sub-questions, routes each to the appropriate retrieval tool, evaluates the quality of what it retrieved, and iterates with reformulated queries if the initial retrieval was insufficient. This makes it significantly more capable than standard RAG on multi-part, ambiguous, or cross-source queries. For a foundational overview of RAG before the agentic layer, see our RAG business guide.
What is the difference between traditional RAG and agentic RAG?
Traditional RAG follows a linear pipeline: embed query, retrieve top-k chunks, rerank, generate — one shot, no recovery from poor retrieval. Agentic RAG replaces this with a dynamic graph: plan, route, retrieve, evaluate, and iterate. The agent loops back to reformulate if retrieval quality is low. The result is higher accuracy on complex queries at the cost of increased latency and token cost. The comparison table above covers all eight dimensions in detail.
How do you implement agentic RAG with LangGraph?
In LangGraph, agentic RAG is implemented as a stateful graph where each node handles one pipeline step: query decomposition, tool routing and retrieval, relevance evaluation, and synthesis. Conditional edges connect the evaluation node back to retrieval (the retry loop) or forward to synthesis depending on the relevance score. The pseudocode in the architecture section of this article illustrates the pattern. See the LangGraph documentation and the LangChain cookbook for runnable examples.
What are the main agentic RAG patterns?
Five patterns cover 95% of enterprise use cases: Router (classify and route to the right source), ReAct (interleave reasoning and tool calls in a loop), Plan-and-Execute (plan the full retrieval strategy before executing), Multi-Agent Retrieval (specialized agents per domain, coordinated by an orchestrator), and Self-RAG (inline self-evaluation of retrieval and generation quality). Most production systems start with the router pattern and add complexity only where simpler patterns demonstrably fail. The pattern selection table in this article maps each pattern to query complexity, source count, and recommended frameworks.
How do you evaluate an agentic RAG system?
Use the RAGAS framework for core metrics: faithfulness, answer relevance, context precision, and context recall. Add agentic-specific metrics: tool selection accuracy, decomposition quality, loop termination rate, and P50/P90/P99 latency. Build a golden dataset of 100–200 queries across your complexity distribution and run evaluation weekly. A drop in context recall after a knowledge base update immediately surfaces a chunking or embedding regression — that is the operational signal you want to catch early.
Is agentic RAG suitable for small teams without ML engineers?
Building agentic RAG from scratch with LangGraph or LlamaIndex requires ML engineering experience and ongoing maintenance. Managed platforms like Heeya implement agentic patterns — query rewriting, hybrid retrieval, tool routing — under the hood, exposed through a no-code interface. For most SMB and mid-market use cases, a managed platform delivers the practical benefits without the infrastructure overhead. Custom builds are justified when your use case requires proprietary retrieval logic, on-premise deployment, or orchestration complexity beyond what any managed platform supports. — Written by Anas Rabhi.
Deploy agentic RAG without building the infrastructure
Heeya gives your team a production-ready agentic RAG agent — hybrid retrieval, query rewriting, tool routing, EU data residency — without writing orchestration code. Flat monthly pricing, no ML team required.