0Pricing
AI Engineering Academy · Lesson

Document-Specific Strategies for Code and HTML

Apply specialized chunking for Python code using AST-based function splitters, for HTML using tag-aware parsers, and for Markdown using header hierarchy.

Why Generic Chunking Fails Specialized Docs

Text-based chunking was designed for prose, but real-world data includes source code, HTML pages, and Markdown documentation. Splitting code at a fixed character boundary can sever a function in the middle of its body, making the chunk useless for retrieval. Specialized documents need chunkers that understand their internal structure, not just their length.

AST-Based Python Code Chunking

The Abstract Syntax Tree (AST) of a Python file captures every function, class, and module as a structured node. By walking the AST you can extract each function or method as its own chunk, keeping the signature, docstring, and body together. LangChain's PythonCodeTextSplitter uses this approach internally.

import ast
import textwrap

def extract_functions(source_code: str) -> list[dict]:
    tree = ast.parse(source_code)
    chunks = []
    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            start = node.lineno - 1
            end = node.end_lineno
            lines = source_code.splitlines()[start:end]
            chunks.append({
                'name': node.name,
                'code': '\n'.join(lines),
                'start_line': node.lineno,
            })
    return chunks

All lessons in this course

  1. Why Naive Chunking Hurts Retrieval
  2. Semantic Chunking with Embedding Similarity
  3. Parent-Child and Small-to-Big Retrieval
  4. Document-Specific Strategies for Code and HTML
← Back to AI Engineering Academy