AI Engineering Academy · Lesson

Implementing BM25 Keyword Search

Set up BM25 using rank_bm25 in Python, index your document corpus, and run keyword searches that handle exact terms, technical jargon, and product names reliably.

Installing rank_bm25

rank_bm25 is a lightweight Python library that provides BM25Okapi, BM25L, and BM25Plus variants of the BM25 algorithm. It requires no external services, runs entirely in memory, and can index thousands of documents in seconds on commodity hardware. Install it with pip install rank-bm25 and you are ready to build keyword search without any infrastructure setup.

# Install: pip install rank-bm25
from rank_bm25 import BM25Okapi

# BM25Okapi is the most common variant
# BM25L and BM25Plus handle very short documents better
# For most RAG use cases BM25Okapi is the right choice

corpus = [
    'Python decorator pattern explained with examples',
    'How to use context managers in Python',
    'JavaScript async await tutorial',
]
tokenized = [doc.lower().split() for doc in corpus]
bm25 = BM25Okapi(tokenized)
print('Index built with', len(corpus), 'documents')

Tokenization: The Critical First Step

BM25 operates on token lists, not raw strings. The quality of your tokenization directly impacts retrieval quality. Simple whitespace splitting misses punctuation stripping, stemming, and stop word removal. For production systems, use a proper tokenizer that lowercases text, removes punctuation, strips stop words, and optionally applies stemming to match morphological variants like 'run', 'runs', and 'running'.

import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

STOP_WORDS = set(stopwords.words('english'))
stemmer = PorterStemmer()

def tokenize(text: str) -> list[str]:
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    tokens = text.split()
    tokens = [t for t in tokens if t not in STOP_WORDS and len(t) > 1]
    tokens = [stemmer.stem(t) for t in tokens]
    return tokens

print(tokenize('Running Python decorators efficiently in production!'))
# ['run', 'python', 'decor', 'effici', 'product']

All lessons in this course

Dense vs Sparse Retrieval: Trade-offs
Implementing BM25 Keyword Search
Reciprocal Rank Fusion for Score Merging
Hybrid Search in Pinecone and pgvector

← Back to AI Engineering Academy