Machine Learning Academy · Lesson

ColumnTransformer Inside a Pipeline

Learners will nest ColumnTransformer for mixed numeric/categorical preprocessing inside a Pipeline so heterogeneous raw data enters directly without manual splitting.

The Mixed-Data Problem

Real-world tabular datasets almost always contain a mix of numeric columns (age, salary, temperature) and categorical columns (city, product category, gender). Each type needs different preprocessing: numeric columns need scaling or imputation, while categorical columns need encoding. ColumnTransformer lets you apply different transformers to different subsets of columns in parallel, producing a single clean feature matrix.

ColumnTransformer: The Basic Structure

ColumnTransformer takes a list of (name, transformer, columns) triples. The columns can be a list of column names, a list of integer indices, a boolean mask, or a sklearn selector like make_column_selector. After transformation, the results from all transformers are horizontally concatenated.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

ct = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'salary']),
    ('cat', OneHotEncoder(handle_unknown='ignore'), ['city', 'education'])
])

print('Transformers:', [name for name, _, _ in ct.transformers])

All lessons in this course

Building Your First Pipeline: Scaler Plus Classifier
ColumnTransformer Inside a Pipeline
Cross-Validating and Grid-Searching a Full Pipeline
Saving and Loading a Pipeline with joblib

← Back to Machine Learning Academy