Machine Learning Academy · Lesson

Bag of Words: CountVectorizer and TfidfVectorizer

Learners will tokenise text, build a vocabulary, convert documents to count vectors, and apply TF-IDF weighting to downweight common words.

The Text-to-Numbers Problem

Machine learning models require numerical inputs, but text is inherently unstructured. Converting text documents into a format a model can understand requires a systematic approach. The Bag of Words (BoW) model is the simplest and most widely used method: it treats a document as an unordered collection of words, ignoring grammar and word order, and counts how many times each word appears. The result is a numeric vector — one number per word in the vocabulary. Despite its simplicity, BoW is surprisingly effective for text classification, spam filtering, and sentiment analysis.

# Bag of Words: ignore order, just count words
doc1 = 'the cat sat on the mat'
doc2 = 'the cat ate the rat'

# Vocabulary: all unique words across documents
vocab = sorted(set(doc1.split() + doc2.split()))
print('Vocabulary:', vocab)

# Count vectors
vec1 = [doc1.split().count(w) for w in vocab]
vec2 = [doc2.split().count(w) for w in vocab]
print('doc1 vector:', vec1)
print('doc2 vector:', vec2)

CountVectorizer: Building the Vocabulary

Scikit-learn's CountVectorizer automates the Bag of Words process. It: (1) tokenises each document by splitting on whitespace and punctuation, (2) builds a vocabulary from all unique tokens seen during fit(), and (3) converts each document to a sparse vector of word counts. The result is a document-term matrix where rows are documents and columns are vocabulary words. Sparse format is used because most words appear in only a small fraction of documents — most entries are zero.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'I love Python and machine learning',
    'Machine learning is awesome',
    'Python is great for data science',
    'I hate bugs in Python code'
]

vec = CountVectorizer()
X = vec.fit_transform(corpus)  # Returns sparse matrix

print('Vocabulary size:', len(vec.vocabulary_))
print('Matrix shape:', X.shape)  # (4 docs, N vocabulary words)
print('Vocabulary:', sorted(vec.vocabulary_.keys()))

All lessons in this course

Bayes' Theorem in Plain Language
Bag of Words: CountVectorizer and TfidfVectorizer
Training a Multinomial Naive Bayes Classifier
Laplace Smoothing and Zero-Probability Problem

← Back to Machine Learning Academy