AI Engineering Academy · Lesson

Document Loading and Text Extraction

Load PDFs, Word documents, and plain text files using Python libraries, clean the extracted text, and prepare it for chunking and embedding.

Why Document Loading Is Non-Trivial

The first stage of any RAG pipeline is getting raw text out of your documents. This sounds simple but is surprisingly tricky in practice. PDF files may embed text as characters, as images, or as a mix of both. Word documents contain formatting markup you must strip. HTML pages include navigation menus and ads alongside the real content. A robust loader must handle all these cases and produce clean, coherent text for chunking.

Loading PDFs with pypdf

pypdf is the standard Python library for reading PDF files that contain embedded text. It extracts text page by page, preserving the original page boundaries which are valuable metadata. However, pypdf cannot read scanned PDFs (images of text) — those require OCR. Always extract page numbers alongside text so you can cite the exact page in your RAG citations.

from pypdf import PdfReader

def load_pdf(file_path):
    reader = PdfReader(file_path)
    pages = []
    for page_num, page in enumerate(reader.pages, start=1):
        text = page.extract_text()
        if text and text.strip():  # skip blank pages
            pages.append({
                'text': text,
                'metadata': {
                    'source': file_path,
                    'page': page_num,
                    'doc_type': 'pdf'
                }
            })
    return pages

docs = load_pdf('annual_report.pdf')
print(f'Loaded {len(docs)} non-empty pages')

All lessons in this course

Document Loading and Text Extraction
Chunking Strategies: Fixed vs Sentence vs Recursive
Indexing: Embedding and Storing Chunks
Query, Retrieve, and Generate

← Back to AI Engineering Academy