0PricingLogin
AI Agents · Lesson

Multi-Document Q&A Agents

Indexing a document corpus and answering questions across all documents.

Multi-Document Q&A Overview

A multi-document Q&A agent answers questions by retrieving relevant content from a collection of N documents, synthesizing an answer, and attributing each claim to its source.

Unlike single-document Q&A, multi-doc agents must handle conflicting information across sources and reason about which documents are most relevant to the question.

Indexing Multiple Documents

Before answering any questions, all documents must be indexed: parsed, chunked, embedded, and stored in a vector database. Each chunk is stored with metadata linking it back to its source document.

import chromadb
from chromadb.utils import embedding_functions
import os

client = chromadb.PersistentClient(path='./doc_index')
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.getenv('OPENAI_API_KEY'),
    model_name='text-embedding-3-small'
)
collection = client.get_or_create_collection('documents', embedding_function=ef)

def index_document(doc_id, doc_path, doc_title):
    # Parse and chunk
    chunks = pdf_to_chunks(doc_path, chunk_size=800, overlap=150)
    for i, chunk in enumerate(chunks):
        chunk_id = f'{doc_id}_chunk_{i}'
        collection.add(
            ids=[chunk_id],
            documents=[chunk['text']],
            metadatas=[{
                'doc_id': doc_id,
                'title': doc_title,
                'page': chunk['page'],
                'source_file': doc_path
            }]
        )
    print(f'Indexed {len(chunks)} chunks from: {doc_title}')

All lessons in this course

  1. PDF Parsing with PyMuPDF and pdfplumber
  2. OCR for Scanned Documents
  3. Multi-Document Q&A Agents
  4. Document Classification and Routing
← Back to AI Agents