Pandas & NumPy Academy · Lesson

Standardising Inconsistent Categories

Normalise free-text category columns by mapping aliases, correcting typos with fuzzy matching, and enforcing a canonical list.

The Problem of Inconsistent Categories

When category data is entered by humans, the same concept appears under many different spellings: Electronics, electronics, ELECTRONICS, Electronicss. A groupby on this column produces dozens of tiny groups instead of one meaningful group. Standardising categories into a canonical list is one of the most important cleaning steps before any aggregation or machine learning feature creation.

import pandas as pd

df = pd.read_csv('products.csv')
print(df['category'].value_counts().head(20))

Normalising Case and Whitespace

The first and easiest standardisation step is normalising case and whitespace. Apply .str.strip().str.lower() to remove leading/trailing spaces and convert everything to lowercase before any other comparison. This single step collapses many variants: Electronics, electronics, and ' Electronics ' all become electronics after normalisation.

df['category_clean'] = df['category'].str.strip().str.lower()

print('Before:', df['category'].nunique())
print('After:', df['category_clean'].nunique())

All lessons in this course

← Back to Pandas & NumPy Academy