Document-Specific Strategies for Code and HTML
Apply specialized chunking for Python code using AST-based function splitters, for HTML using tag-aware parsers, and for Markdown using header hierarchy.
Why Generic Chunking Fails Specialized Docs
Text-based chunking was designed for prose, but real-world data includes source code, HTML pages, and Markdown documentation. Splitting code at a fixed character boundary can sever a function in the middle of its body, making the chunk useless for retrieval. Specialized documents need chunkers that understand their internal structure, not just their length.
AST-Based Python Code Chunking
The Abstract Syntax Tree (AST) of a Python file captures every function, class, and module as a structured node. By walking the AST you can extract each function or method as its own chunk, keeping the signature, docstring, and body together. LangChain's PythonCodeTextSplitter uses this approach internally.
import ast
import textwrap
def extract_functions(source_code: str) -> list[dict]:
tree = ast.parse(source_code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
start = node.lineno - 1
end = node.end_lineno
lines = source_code.splitlines()[start:end]
chunks.append({
'name': node.name,
'code': '\n'.join(lines),
'start_line': node.lineno,
})
return chunksAll lessons in this course
- Why Naive Chunking Hurts Retrieval
- Semantic Chunking with Embedding Similarity
- Parent-Child and Small-to-Big Retrieval
- Document-Specific Strategies for Code and HTML