Machine Learning Academy · Lesson

Combining Steps with ColumnTransformer

Learners will apply different preprocessing steps to different column types simultaneously using ColumnTransformer, producing a single clean feature matrix.

The Problem ColumnTransformer Solves

Real datasets contain a mix of column types: some are numeric, some are ordinal, and some are nominal categorical. Applying the same transformer to all columns would be incorrect — you cannot StandardScale a one-hot column or OrdinalEncode a continuous number. Before ColumnTransformer, data scientists had to manually slice DataFrames, apply different transformers, and re-concatenate — a fragile, leakage-prone process. ColumnTransformer solves this by applying different transformations to different column subsets simultaneously inside a single, pipeline-compatible object.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age':       [25, 35, 45, 28],
    'income':    [40000, 80000, 120000, 55000],
    'education': ['bachelor', 'master', 'PhD', 'high school'],
    'city':      ['NYC', 'LA', 'NYC', 'Chicago']
})

# Need: StandardScaler for age+income
#       OrdinalEncoder for education
#       OneHotEncoder for city
# ColumnTransformer applies all three at once

ColumnTransformer Syntax and Structure

A ColumnTransformer is constructed as a list of (name, transformer, columns) tuples. The name is a string identifier used in logging and parameter access. The transformer is any scikit-learn estimator with fit and transform. The columns can be a list of column names (for DataFrames), integer indices (for arrays), or a boolean mask. All transformations are applied in parallel, and outputs are concatenated column-wise into a single matrix.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder

ct = ColumnTransformer([
    ('scale', StandardScaler(), ['age', 'income']),
    ('ordinal', OrdinalEncoder(
        categories=[['high school', 'bachelor', 'master', 'PhD']]
    ), ['education']),
    ('ohe', OneHotEncoder(drop='first', sparse_output=False), ['city'])
])

# Output: scaled numeric + ordinal int + one-hot columns concatenated

All lessons in this course

Handling Missing Values: Drop, Impute, and Flag
Feature Scaling: StandardScaler and MinMaxScaler
Encoding Categorical Variables: OrdinalEncoder and OneHotEncoder
Combining Steps with ColumnTransformer

← Back to Machine Learning Academy