Standardising Inconsistent Categories
Normalise free-text category columns by mapping aliases, correcting typos with fuzzy matching, and enforcing a canonical list.
The Problem of Inconsistent Categories
When category data is entered by humans, the same concept appears under many different spellings: Electronics, electronics, ELECTRONICS, Electronicss. A groupby on this column produces dozens of tiny groups instead of one meaningful group. Standardising categories into a canonical list is one of the most important cleaning steps before any aggregation or machine learning feature creation.
import pandas as pd
df = pd.read_csv('products.csv')
print(df['category'].value_counts().head(20))Normalising Case and Whitespace
The first and easiest standardisation step is normalising case and whitespace. Apply .str.strip().str.lower() to remove leading/trailing spaces and convert everything to lowercase before any other comparison. This single step collapses many variants: Electronics, electronics, and ' Electronics ' all become electronics after normalisation.
df['category_clean'] = df['category'].str.strip().str.lower()
print('Before:', df['category'].nunique())
print('After:', df['category_clean'].nunique())All lessons in this course
- Detecting and Removing Duplicates
- Outlier Detection and Treatment
- Standardising Inconsistent Categories
- Schema Validation and Assertions