Document Loading and Text Extraction
Load PDFs, Word documents, and plain text files using Python libraries, clean the extracted text, and prepare it for chunking and embedding.
Why Document Loading Is Non-Trivial
The first stage of any RAG pipeline is getting raw text out of your documents. This sounds simple but is surprisingly tricky in practice. PDF files may embed text as characters, as images, or as a mix of both. Word documents contain formatting markup you must strip. HTML pages include navigation menus and ads alongside the real content. A robust loader must handle all these cases and produce clean, coherent text for chunking.
Loading PDFs with pypdf
pypdf is the standard Python library for reading PDF files that contain embedded text. It extracts text page by page, preserving the original page boundaries which are valuable metadata. However, pypdf cannot read scanned PDFs (images of text) — those require OCR. Always extract page numbers alongside text so you can cite the exact page in your RAG citations.
from pypdf import PdfReader
def load_pdf(file_path):
reader = PdfReader(file_path)
pages = []
for page_num, page in enumerate(reader.pages, start=1):
text = page.extract_text()
if text and text.strip(): # skip blank pages
pages.append({
'text': text,
'metadata': {
'source': file_path,
'page': page_num,
'doc_type': 'pdf'
}
})
return pages
docs = load_pdf('annual_report.pdf')
print(f'Loaded {len(docs)} non-empty pages')All lessons in this course
- Document Loading and Text Extraction
- Chunking Strategies: Fixed vs Sentence vs Recursive
- Indexing: Embedding and Storing Chunks
- Query, Retrieve, and Generate