Indexing a Document Set
Embed every chunk and store the vectors in an index — the offline preparation step of any RAG system.
Ingestion Pipeline
The offline RAG pipeline has four steps:
- Load documents (PDF, HTML, MD)
- Chunk each document
- Embed each chunk
- Store the vectors + metadata in an index
Step 1: Load Documents
Use specialized loaders for different formats:
# PDFs
from pypdf import PdfReader
text = ''
for page in PdfReader('doc.pdf').pages:
text += page.extract_text()
# HTML
from bs4 import BeautifulSoup
text = BeautifulSoup(html, 'html.parser').get_text()
# Markdown — just read the file
text = open('doc.md').read()