Building Production-Ready Agentic RAG: From Vector Search to Graph-Aware Retrieval

Retrieval-Augmented Generation (RAG) has evolved from a 2020 research paper into the backbone of enterprise AI systems. But naive "embed + retrieve + generate" pipelines fail in production. This tutorial walks you through building a production-grade Agentic RAG system that combines hybrid search, graph-based retrieval, and autonomous agent reasoning.

Why Naive RAG Fails in Production

The basic RAG pipeline — chunk documents, embed them, run cosine similarity, feed top-k to an LLM — works for demos. In production, you hit these walls:

  • Hallucinated retrieval: The top-3 most similar chunks might all be wrong for the actual question.
  • Lost context: Related information spans multiple chunks that never surface together.
  • No reasoning over structure: Vector search doesn't understand relationships like "this function calls that function."
  • Stale knowledge: Your vector index was built last month; the docs changed yesterday.

Let's fix each of these.

Step 1: Hybrid Search — BM25 + Dense Vectors

Single-embedding retrieval misses keyword-heavy queries. Hybrid search combines lexical (BM25) and semantic (dense vector) search with learned weighting.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, PayloadSchemaType, TextIndexParams
import openai

client = QdrantClient(host="localhost", port=6333)

# Create collection with both vector and payload indexing
client.create_collection(
    collection_name="code_docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Add BM25-style text index on payload
client.create_payload_index(
    collection_name="code_docs",
    field_name="content",
    field_schema=TextIndexParams(type="text", tokenizer="word", min_token_len=2, max_token_len=15)
)

def hybrid_search(query: str, top_k: int = 5):
    # Dense vector search
    query_embedding = openai.embeddings.create(
        model="text-embedding-3-small", input=query
    ).data[0].embedding

    dense_results = client.query_points(
        collection_name="code_docs",
        query=query_embedding,
        limit=top_k * 2  # Over-fetch for reranking
    )

    # Reciprocal Rank Fusion (RRF) scoring
    # Combine BM25 and dense results with learned alpha
    alpha = 0.7  # Learned weight for dense vs sparse
    return rerank_rrf(dense_results, sparse_results, alpha)

Step 2: Semantic Chunking — Beyond Fixed-Size Splits

Fixed-size chunking (e.g., 500 tokens with 50 overlap) destroys code structure. Semantic chunking respects logical boundaries.

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    Language
)

# Language-aware splitting for Python code
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=800,
    chunk_overlap=100
)

# For markdown/docs: respect headers as boundaries
md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=1000,
    chunk_overlap=150
)

def semantic_chunk_python(source_code: str) -> list[dict]:
    """Chunk Python code respecting class/function boundaries."""
    chunks = python_splitter.split_text(source_code)
    return [
        {
            "content": chunk,
            "chunk_type": infer_chunk_type(chunk),  # class, function, import, etc.
            "complexity": estimate_complexity(chunk),
        }
        for chunk in chunks
    ]

def infer_chunk_type(chunk: str) -> str:
    if chunk.strip().startswith("class "):
        return "class"
    elif chunk.strip().startswith(("def ", "async def ")):
        return "function"
    elif chunk.strip().startswith("import ") or chunk.strip().startswith("from "):
        return "import"
    return "other"

Key insight: store chunk metadata (type, complexity, file path) as payload. This enables metadata filtering during retrieval — e.g., "only search class definitions."

Step 3: Cross-Encoder Reranking

Bi-encoders (like text-embedding-3-small) are fast but approximate. Cross-encoders are slower but far more accurate for reranking top candidates.

from sentence_transformers import CrossEncoder

# Load a cross-encoder reranker
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_with_cross_encoder(query: str, candidates: list[dict], top_k: int = 5):
    """Rerank retrieved documents using cross-encoder."""
    pairs = [(query, doc["content"]) for doc in candidates]
    scores = reranker.predict(pairs)

    # Sort by cross-encoder score
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

# Usage in your RAG pipeline
def retrieve_rerank_generate(query: str):
    # Step 1: Hybrid retrieval (fast)
    candidates = hybrid_search(query, top_k=20)

    # Step 2: Cross-encoder reranking (accurate)
    refined = rerank_with_cross_encoder(query, candidates, top_k=5)

    # Step 3: Generate answer
    return generate_with_context(query, refined)

Production tip: cache cross-encoder scores for repeated queries. The re-ranking step is ~10x slower than dense retrieval but only runs on 20 documents, not your entire corpus.

Step 4: Self-Query Retrieval — Let the LLM Write Its Own Filters

Instead of blindly retrieving, let an LLM analyze the query and construct a structured retrieval plan with metadata filters.

from pydantic import BaseModel, Field
from openai import OpenAI

client = OpenAI()

class RetrievalPlan(BaseModel):
    """Structured plan for document retrieval."""
    query_rewrite: str = Field(description="Optimized search query")
    filters: dict = Field(description="Metadata filters to apply")
    chunk_types: list[str] = Field(description="Preferred chunk types")
    needs_code: bool = Field(description="Whether code examples are needed")
    max_results: int = Field(description="Number of results to retrieve")

def build_retrieval_plan(user_query: str) -> RetrievalPlan:
    response = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Analyze the query and create a retrieval plan."},
            {"role": "user", "content": f"Query: {user_query}"}
        ],
        response_format=RetrievalPlan
    )
    return response.choices[0].message.parsed

# Example: "How does the AuthMiddleware class handle token refresh?"
# Produces:
# RetrievalPlan(
#     query_rewrite="AuthMiddleware token refresh implementation",
#     filters={"file_path": "*middleware*"},
#     chunk_types=["class", "function"],
#     needs_code=True,
#     max_results=8
# )

This transforms naive similarity search into intelligent document navigation — the LLM decides what to look for and where to look before any retrieval happens.

Step 5: Graph-RAG — Knowledge Graphs Meet Vector Search

Vector search finds similar text. Graph retrieval finds related concepts. Graph-RAG combines both by building a knowledge graph from your documents and traversing it during retrieval.

import networkx as nx
from collections import defaultdict

class DocumentGraph:
    """Knowledge graph built from code documentation."""

    def __init__(self):
        self.graph = nx.DiGraph()
        self.entities = defaultdict(list)  # entity_name -> node_ids

    def add_function(self, func_name: str, file_path: str,
                     calls: list[str], imports: list[str]):
        node_id = f"func:{func_name}"
        self.graph.add_node(node_id, type="function", file=file_path)

        for called in calls:
            self.graph.add_edge(node_id, f"func:{called}", relation="calls")

        for imp in imports:
            self.graph.add_edge(node_id, f"module:{imp}", relation="imports")

        self.entities[func_name].append(node_id)

    def graph_augmented_retrieve(self, query: str, top_k: int = 5):
        """Retrieve using vector search + graph traversal."""
        # Step 1: Standard vector retrieval
        seed_nodes = vector_search(query, top_k=3)

        # Step 2: Expand via graph traversal (2-hop)
        expanded = set()
        for node in seed_nodes:
            expanded.add(node)
            expanded.update(nx.descendants(self.graph, node, cutoff=2))
            expanded.update(nx.ancestors(self.graph, node, cutoff=1))

        # Step 3: Re-rank expanded set by graph centrality + vector similarity
        return self._rank_by_centrality_and_similarity(expanded, query, top_k)

Graph-RAG excels when answers require multi-hop reasoning: "What happens when process_payment() fails?" requires traversing the call graph to find error handlers, retry logic, and notification callbacks — something vector search alone cannot discover.

Step 6: Agentic RAG — Autonomous Retrieval Loops

The final evolution: instead of a single retrieval step, an agent loops through retrieve → evaluate → decide → retrieve-again until it has enough context.

from typing import Literal

class AgenticRAG:
    def __init__(self, retriever, llm, max_iterations: int = 5):
        self.retriever = retriever
        self.llm = llm
        self.max_iterations = max_iterations

    def query(self, question: str) -> str:
        context = []
        thoughts = []

        for iteration in range(self.max_iterations):
            # Agent decides: do I have enough context?
            decision = self._evaluate_context(question, context)

            if decision.action == "answer":
                return self._generate_answer(question, context, thoughts)

            elif decision.action == "retrieve":
                # Agent formulates a NEW search query
                results = self.retriever.retrieve(decision.search_query)
                context.extend(results)
                thoughts.append(f"Iteration {iteration}: searched for '{decision.search_query}'")

            elif decision.action == "reflect":
                # Agent realizes it needs to rethink the approach
                thoughts.append(f"Iteration {iteration}: reflecting on approach")
                context = self._restructure_context(context)

        # Ran out of iterations — answer with what we have
        return self._generate_answer(question, context, thoughts, partial=True)

    def _evaluate_context(self, question: str, context: list) -> dict:
        """LLM evaluates whether current context is sufficient."""
        response = self.llm.chat([
            {"role": "system", "content": """
                Evaluate if the gathered context answers the question.
                Respond with JSON: {"action": "answer|retrieve|reflect",
                "search_query": "...", "confidence": 0.0-1.0,
                "reasoning": "..."}
            """},
            {"role": "user", "content": f"Question: {question}\nContext: {context}"}
        ])
        return json.loads(response)

The agent loop is the difference between "I found 3 chunks" and "I searched, realized I needed error handling code, searched again for the exception handler, found it, and now I can answer."

Step 7: Evaluation with RAGAS

You can't improve what you can't measure. RAGAS provides metrics for RAG system quality.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,      # Does the answer stick to retrieved context?
    answer_relevance,  # Does the answer actually address the question?
    context_precision, # Are the retrieved chunks relevant?
    context_recall     # Did we retrieve all the necessary context?
)

# Test dataset
test_data = [
    {
        "question": "How does the middleware handle expired tokens?",
        "answer": "The middleware checks token expiry...",
        "contexts": ["The AuthMiddleware class...", "Token refresh logic..."],
        "ground_truth": "It calls refresh_token() and retries..."
    },
    # ... more test cases
]

results = evaluate(
    dataset=test_data,
    metrics=[faithfulness, answer_relevance, context_precision, context_recall]
)

print(results)
# faithfulness: 0.89  answer_relevance: 0.92
# context_precision: 0.85  context_recall: 0.78

Putting It All Together

A production Agentic RAG pipeline combines all seven steps:

  1. Self-query planning — LLM analyzes the question
  2. Hybrid retrieval — BM25 + dense vectors
  3. Graph expansion — Traverse knowledge graph for related nodes
  4. Cross-encoder reranking — Precise relevance scoring
  5. Agentic evaluation loop — Retrieve more if needed
  6. Generation — Answer with grounded context
  7. RAGAS evaluation — Measure and improve

The jump from naive RAG to agentic Graph-RAG isn't just incremental — it's the difference between a system that finds similar text and one that understands your codebase. Start with hybrid search + reranking (Steps 1-3), add self-query retrieval (Step 4), then graduate to graph and agentic patterns as your use case demands.

Happy building. The future of code assistants isn't just smarter models — it's smarter retrieval.