Encoding Categorical Variables: OrdinalEncoder and OneHotEncoder
Learners will convert ordinal categories to integers and nominal categories to one-hot vectors, handling unknown categories at inference time.
Why Categorical Encoding Is Required
Machine learning models operate on numbers, not strings. A column containing values like 'red', 'green', 'blue' cannot be fed directly into a scikit-learn estimator. You must convert categories to numeric representations before training. The choice of encoding method matters greatly: a poor encoding can introduce spurious ordinal relationships or explode dimensionality. This lesson covers the two most important encoders: OrdinalEncoder for ordered categories and OneHotEncoder for unordered ones.
import pandas as pd
df = pd.DataFrame({
'size': ['small', 'medium', 'large', 'medium'],
'color': ['red', 'blue', 'green', 'red'],
'price': [10.5, 20.0, 15.0, 18.5]
})
print(df.dtypes)
# size and color are 'object' (string) — must be encoded before modellingOrdinal vs Nominal Categories
Not all categorical variables are equal. Ordinal categories have a meaningful order: small < medium < large, or bad < fair < good < excellent. Nominal categories have no intrinsic order: red, blue, green are just labels. Choosing the wrong encoding creates false relationships — for example, integer-encoding unordered colors as 0, 1, 2 tells the model that blue (1) is somehow between red (0) and green (2), which is meaningless. Always identify whether a variable is ordinal or nominal before encoding.
# Ordinal: clear ordering
ordinal_example = ['low', 'medium', 'high', 'very high']
# Nominal: no meaningful ordering
nominal_example = ['cat', 'dog', 'bird']
# Wrong approach for nominal: integer encoding implies order
# 0=cat, 1=dog, 2=bird -- model thinks dog is between cat and bird
# Correct approach for nominal: one-hot encoding
# cat -> [1, 0, 0]
# dog -> [0, 1, 0]
# bird -> [0, 0, 1]All lessons in this course
- Handling Missing Values: Drop, Impute, and Flag
- Feature Scaling: StandardScaler and MinMaxScaler
- Encoding Categorical Variables: OrdinalEncoder and OneHotEncoder
- Combining Steps with ColumnTransformer