Combining Steps with ColumnTransformer
Learners will apply different preprocessing steps to different column types simultaneously using ColumnTransformer, producing a single clean feature matrix.
The Problem ColumnTransformer Solves
Real datasets contain a mix of column types: some are numeric, some are ordinal, and some are nominal categorical. Applying the same transformer to all columns would be incorrect — you cannot StandardScale a one-hot column or OrdinalEncode a continuous number. Before ColumnTransformer, data scientists had to manually slice DataFrames, apply different transformers, and re-concatenate — a fragile, leakage-prone process. ColumnTransformer solves this by applying different transformations to different column subsets simultaneously inside a single, pipeline-compatible object.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'age': [25, 35, 45, 28],
'income': [40000, 80000, 120000, 55000],
'education': ['bachelor', 'master', 'PhD', 'high school'],
'city': ['NYC', 'LA', 'NYC', 'Chicago']
})
# Need: StandardScaler for age+income
# OrdinalEncoder for education
# OneHotEncoder for city
# ColumnTransformer applies all three at onceColumnTransformer Syntax and Structure
A ColumnTransformer is constructed as a list of (name, transformer, columns) tuples. The name is a string identifier used in logging and parameter access. The transformer is any scikit-learn estimator with fit and transform. The columns can be a list of column names (for DataFrames), integer indices (for arrays), or a boolean mask. All transformations are applied in parallel, and outputs are concatenated column-wise into a single matrix.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
ct = ColumnTransformer([
('scale', StandardScaler(), ['age', 'income']),
('ordinal', OrdinalEncoder(
categories=[['high school', 'bachelor', 'master', 'PhD']]
), ['education']),
('ohe', OneHotEncoder(drop='first', sparse_output=False), ['city'])
])
# Output: scaled numeric + ordinal int + one-hot columns concatenated