AI Agents · Lesson

Indexing a Document Set

Embed every chunk and store the vectors in an index — the offline preparation step of any RAG system.

Ingestion Pipeline

The offline RAG pipeline has four steps:

Load documents (PDF, HTML, MD)
Chunk each document
Embed each chunk
Store the vectors + metadata in an index

Step 1: Load Documents

Use specialized loaders for different formats:

# PDFs
from pypdf import PdfReader
text = ''
for page in PdfReader('doc.pdf').pages:
    text += page.extract_text()

# HTML
from bs4 import BeautifulSoup
text = BeautifulSoup(html, 'html.parser').get_text()

# Markdown — just read the file
text = open('doc.md').read()

All lessons in this course

What RAG Solves (Knowledge Cut-off, Hallucinations)
Chunking Strategies (Fixed, Sentence, Semantic)
Indexing a Document Set
Building a Naive RAG with FAISS or Chroma

← Back to AI Agents