Machine Learning Academy · Lesson

Cross-Validating and Grid-Searching a Full Pipeline

Learners will pass a Pipeline to cross_val_score and GridSearchCV, using double-underscore notation to specify hyperparameters of individual steps.

Why Grid-Search a Full Pipeline?

Hyperparameter tuning should always be combined with cross-validation to prevent overfitting the validation set. When your preprocessing steps have their own parameters (e.g., PCA's n_components, OneHotEncoder's drop), those must be tuned simultaneously with the model's parameters. Wrapping everything in a Pipeline and passing it to GridSearchCV ensures all of this is done correctly without leakage.

Passing a Pipeline to cross_val_score

The simplest way to evaluate a Pipeline fairly is cross_val_score(pipeline, X, y, cv=5). Each fold re-fits the entire pipeline — including scaler and classifier — on the training portion, then evaluates on the held-out fold. The mean and standard deviation of the returned array give you a reliable generalisation estimate.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine
import numpy as np

X, y = load_wine(return_X_y=True)

pipe = Pipeline([
    ('sc', StandardScaler()),
    ('lr', LogisticRegression(C=1.0, max_iter=300))
])

scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f'CV accuracy: {np.mean(scores):.4f} +/- {np.std(scores):.4f}')

All lessons in this course

Building Your First Pipeline: Scaler Plus Classifier
ColumnTransformer Inside a Pipeline
Cross-Validating and Grid-Searching a Full Pipeline
Saving and Loading a Pipeline with joblib

← Back to Machine Learning Academy