0Pricing
AI Agents · Lesson

Indexing a Document Set

Embed every chunk and store the vectors in an index — the offline preparation step of any RAG system.

Ingestion Pipeline

The offline RAG pipeline has four steps:

  1. Load documents (PDF, HTML, MD)
  2. Chunk each document
  3. Embed each chunk
  4. Store the vectors + metadata in an index

Step 1: Load Documents

Use specialized loaders for different formats:

# PDFs
from pypdf import PdfReader
text = ''
for page in PdfReader('doc.pdf').pages:
    text += page.extract_text()

# HTML
from bs4 import BeautifulSoup
text = BeautifulSoup(html, 'html.parser').get_text()

# Markdown — just read the file
text = open('doc.md').read()

All lessons in this course

  1. What RAG Solves (Knowledge Cut-off, Hallucinations)
  2. Chunking Strategies (Fixed, Sentence, Semantic)
  3. Indexing a Document Set
  4. Building a Naive RAG with FAISS or Chroma
← Back to AI Agents