Bag of Words: CountVectorizer and TfidfVectorizer
Learners will tokenise text, build a vocabulary, convert documents to count vectors, and apply TF-IDF weighting to downweight common words.
The Text-to-Numbers Problem
Machine learning models require numerical inputs, but text is inherently unstructured. Converting text documents into a format a model can understand requires a systematic approach. The Bag of Words (BoW) model is the simplest and most widely used method: it treats a document as an unordered collection of words, ignoring grammar and word order, and counts how many times each word appears. The result is a numeric vector — one number per word in the vocabulary. Despite its simplicity, BoW is surprisingly effective for text classification, spam filtering, and sentiment analysis.
# Bag of Words: ignore order, just count words
doc1 = 'the cat sat on the mat'
doc2 = 'the cat ate the rat'
# Vocabulary: all unique words across documents
vocab = sorted(set(doc1.split() + doc2.split()))
print('Vocabulary:', vocab)
# Count vectors
vec1 = [doc1.split().count(w) for w in vocab]
vec2 = [doc2.split().count(w) for w in vocab]
print('doc1 vector:', vec1)
print('doc2 vector:', vec2)CountVectorizer: Building the Vocabulary
Scikit-learn's CountVectorizer automates the Bag of Words process. It: (1) tokenises each document by splitting on whitespace and punctuation, (2) builds a vocabulary from all unique tokens seen during fit(), and (3) converts each document to a sparse vector of word counts. The result is a document-term matrix where rows are documents and columns are vocabulary words. Sparse format is used because most words appear in only a small fraction of documents — most entries are zero.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'I love Python and machine learning',
'Machine learning is awesome',
'Python is great for data science',
'I hate bugs in Python code'
]
vec = CountVectorizer()
X = vec.fit_transform(corpus) # Returns sparse matrix
print('Vocabulary size:', len(vec.vocabulary_))
print('Matrix shape:', X.shape) # (4 docs, N vocabulary words)
print('Vocabulary:', sorted(vec.vocabulary_.keys()))All lessons in this course
- Bayes' Theorem in Plain Language
- Bag of Words: CountVectorizer and TfidfVectorizer
- Training a Multinomial Naive Bayes Classifier
- Laplace Smoothing and Zero-Probability Problem