Building Production-Ready Agentic RAG: From Vector Search to Graph-Aware Retrieval
Retrieval-Augmented Generation (RAG) has evolved from a 2020 research paper into the backbone of enterprise AI systems. But naive "embed + retrieve + generate" pipelines fail in production. This tutorial walks you through building a production-grade Agentic RAG system that combines hybrid search, graph-based retrieval, and autonomous agent reasoning.
Why Naive RAG Fails in Production
The basic RAG pipeline — chunk documents, embed them, run cosine similarity, feed top-k to an LLM — works for demos. In production, you hit these walls:
- Hallucinated retrieval: The top-3 most similar chunks might all be wrong for the actual question.
- Lost context: Related information spans multiple chunks that never surface together.
- No reasoning over structure: Vector search doesn't understand relationships like "this function calls that function."
- Stale knowledge: Your vector index was built last month; the docs changed yesterday.
Let's fix each of these.
Step 1: Hybrid Search — BM25 + Dense Vectors
Single-embedding retrieval misses keyword-heavy queries. Hybrid search combines lexical (BM25) and semantic (dense vector) search with learned weighting.
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, PayloadSchemaType, TextIndexParams
import openai
client = QdrantClient(host="localhost", port=6333)
# Create collection with both vector and payload indexing
client.create_collection(
collection_name="code_docs",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)
# Add BM25-style text index on payload
client.create_payload_index(
collection_name="code_docs",
field_name="content",
field_schema=TextIndexParams(type="text", tokenizer="word", min_token_len=2, max_token_len=15)
)
def hybrid_search(query: str, top_k: int = 5):
# Dense vector search
query_embedding = openai.embeddings.create(
model="text-embedding-3-small", input=query
).data[0].embedding
dense_results = client.query_points(
collection_name="code_docs",
query=query_embedding,
limit=top_k * 2 # Over-fetch for reranking
)
# Reciprocal Rank Fusion (RRF) scoring
# Combine BM25 and dense results with learned alpha
alpha = 0.7 # Learned weight for dense vs sparse
return rerank_rrf(dense_results, sparse_results, alpha)
Step 2: Semantic Chunking — Beyond Fixed-Size Splits
Fixed-size chunking (e.g., 500 tokens with 50 overlap) destroys code structure. Semantic chunking respects logical boundaries.
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
Language
)
# Language-aware splitting for Python code
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=800,
chunk_overlap=100
)
# For markdown/docs: respect headers as boundaries
md_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.MARKDOWN,
chunk_size=1000,
chunk_overlap=150
)
def semantic_chunk_python(source_code: str) -> list[dict]:
"""Chunk Python code respecting class/function boundaries."""
chunks = python_splitter.split_text(source_code)
return [
{
"content": chunk,
"chunk_type": infer_chunk_type(chunk), # class, function, import, etc.
"complexity": estimate_complexity(chunk),
}
for chunk in chunks
]
def infer_chunk_type(chunk: str) -> str:
if chunk.strip().startswith("class "):
return "class"
elif chunk.strip().startswith(("def ", "async def ")):
return "function"
elif chunk.strip().startswith("import ") or chunk.strip().startswith("from "):
return "import"
return "other"
Key insight: store chunk metadata (type, complexity, file path) as payload. This enables metadata filtering during retrieval — e.g., "only search class definitions."
Step 3: Cross-Encoder Reranking
Bi-encoders (like text-embedding-3-small) are fast but approximate. Cross-encoders are slower but far more accurate for reranking top candidates.
from sentence_transformers import CrossEncoder
# Load a cross-encoder reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank_with_cross_encoder(query: str, candidates: list[dict], top_k: int = 5):
"""Rerank retrieved documents using cross-encoder."""
pairs = [(query, doc["content"]) for doc in candidates]
scores = reranker.predict(pairs)
# Sort by cross-encoder score
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]
# Usage in your RAG pipeline
def retrieve_rerank_generate(query: str):
# Step 1: Hybrid retrieval (fast)
candidates = hybrid_search(query, top_k=20)
# Step 2: Cross-encoder reranking (accurate)
refined = rerank_with_cross_encoder(query, candidates, top_k=5)
# Step 3: Generate answer
return generate_with_context(query, refined)
Production tip: cache cross-encoder scores for repeated queries. The re-ranking step is ~10x slower than dense retrieval but only runs on 20 documents, not your entire corpus.
Step 4: Self-Query Retrieval — Let the LLM Write Its Own Filters
Instead of blindly retrieving, let an LLM analyze the query and construct a structured retrieval plan with metadata filters.
from pydantic import BaseModel, Field
from openai import OpenAI
client = OpenAI()
class RetrievalPlan(BaseModel):
"""Structured plan for document retrieval."""
query_rewrite: str = Field(description="Optimized search query")
filters: dict = Field(description="Metadata filters to apply")
chunk_types: list[str] = Field(description="Preferred chunk types")
needs_code: bool = Field(description="Whether code examples are needed")
max_results: int = Field(description="Number of results to retrieve")
def build_retrieval_plan(user_query: str) -> RetrievalPlan:
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Analyze the query and create a retrieval plan."},
{"role": "user", "content": f"Query: {user_query}"}
],
response_format=RetrievalPlan
)
return response.choices[0].message.parsed
# Example: "How does the AuthMiddleware class handle token refresh?"
# Produces:
# RetrievalPlan(
# query_rewrite="AuthMiddleware token refresh implementation",
# filters={"file_path": "*middleware*"},
# chunk_types=["class", "function"],
# needs_code=True,
# max_results=8
# )
This transforms naive similarity search into intelligent document navigation — the LLM decides what to look for and where to look before any retrieval happens.
Step 5: Graph-RAG — Knowledge Graphs Meet Vector Search
Vector search finds similar text. Graph retrieval finds related concepts. Graph-RAG combines both by building a knowledge graph from your documents and traversing it during retrieval.
import networkx as nx
from collections import defaultdict
class DocumentGraph:
"""Knowledge graph built from code documentation."""
def __init__(self):
self.graph = nx.DiGraph()
self.entities = defaultdict(list) # entity_name -> node_ids
def add_function(self, func_name: str, file_path: str,
calls: list[str], imports: list[str]):
node_id = f"func:{func_name}"
self.graph.add_node(node_id, type="function", file=file_path)
for called in calls:
self.graph.add_edge(node_id, f"func:{called}", relation="calls")
for imp in imports:
self.graph.add_edge(node_id, f"module:{imp}", relation="imports")
self.entities[func_name].append(node_id)
def graph_augmented_retrieve(self, query: str, top_k: int = 5):
"""Retrieve using vector search + graph traversal."""
# Step 1: Standard vector retrieval
seed_nodes = vector_search(query, top_k=3)
# Step 2: Expand via graph traversal (2-hop)
expanded = set()
for node in seed_nodes:
expanded.add(node)
expanded.update(nx.descendants(self.graph, node, cutoff=2))
expanded.update(nx.ancestors(self.graph, node, cutoff=1))
# Step 3: Re-rank expanded set by graph centrality + vector similarity
return self._rank_by_centrality_and_similarity(expanded, query, top_k)
Graph-RAG excels when answers require multi-hop reasoning: "What happens when process_payment() fails?" requires traversing the call graph to find error handlers, retry logic, and notification callbacks — something vector search alone cannot discover.
Step 6: Agentic RAG — Autonomous Retrieval Loops
The final evolution: instead of a single retrieval step, an agent loops through retrieve → evaluate → decide → retrieve-again until it has enough context.
from typing import Literal
class AgenticRAG:
def __init__(self, retriever, llm, max_iterations: int = 5):
self.retriever = retriever
self.llm = llm
self.max_iterations = max_iterations
def query(self, question: str) -> str:
context = []
thoughts = []
for iteration in range(self.max_iterations):
# Agent decides: do I have enough context?
decision = self._evaluate_context(question, context)
if decision.action == "answer":
return self._generate_answer(question, context, thoughts)
elif decision.action == "retrieve":
# Agent formulates a NEW search query
results = self.retriever.retrieve(decision.search_query)
context.extend(results)
thoughts.append(f"Iteration {iteration}: searched for '{decision.search_query}'")
elif decision.action == "reflect":
# Agent realizes it needs to rethink the approach
thoughts.append(f"Iteration {iteration}: reflecting on approach")
context = self._restructure_context(context)
# Ran out of iterations — answer with what we have
return self._generate_answer(question, context, thoughts, partial=True)
def _evaluate_context(self, question: str, context: list) -> dict:
"""LLM evaluates whether current context is sufficient."""
response = self.llm.chat([
{"role": "system", "content": """
Evaluate if the gathered context answers the question.
Respond with JSON: {"action": "answer|retrieve|reflect",
"search_query": "...", "confidence": 0.0-1.0,
"reasoning": "..."}
"""},
{"role": "user", "content": f"Question: {question}\nContext: {context}"}
])
return json.loads(response)
The agent loop is the difference between "I found 3 chunks" and "I searched, realized I needed error handling code, searched again for the exception handler, found it, and now I can answer."
Step 7: Evaluation with RAGAS
You can't improve what you can't measure. RAGAS provides metrics for RAG system quality.
from ragas import evaluate
from ragas.metrics import (
faithfulness, # Does the answer stick to retrieved context?
answer_relevance, # Does the answer actually address the question?
context_precision, # Are the retrieved chunks relevant?
context_recall # Did we retrieve all the necessary context?
)
# Test dataset
test_data = [
{
"question": "How does the middleware handle expired tokens?",
"answer": "The middleware checks token expiry...",
"contexts": ["The AuthMiddleware class...", "Token refresh logic..."],
"ground_truth": "It calls refresh_token() and retries..."
},
# ... more test cases
]
results = evaluate(
dataset=test_data,
metrics=[faithfulness, answer_relevance, context_precision, context_recall]
)
print(results)
# faithfulness: 0.89 answer_relevance: 0.92
# context_precision: 0.85 context_recall: 0.78
Putting It All Together
A production Agentic RAG pipeline combines all seven steps:
- Self-query planning — LLM analyzes the question
- Hybrid retrieval — BM25 + dense vectors
- Graph expansion — Traverse knowledge graph for related nodes
- Cross-encoder reranking — Precise relevance scoring
- Agentic evaluation loop — Retrieve more if needed
- Generation — Answer with grounded context
- RAGAS evaluation — Measure and improve
The jump from naive RAG to agentic Graph-RAG isn't just incremental — it's the difference between a system that finds similar text and one that understands your codebase. Start with hybrid search + reranking (Steps 1-3), add self-query retrieval (Step 4), then graduate to graph and agentic patterns as your use case demands.
Happy building. The future of code assistants isn't just smarter models — it's smarter retrieval.