Cross-Validating and Grid-Searching a Full Pipeline
Learners will pass a Pipeline to cross_val_score and GridSearchCV, using double-underscore notation to specify hyperparameters of individual steps.
Why Grid-Search a Full Pipeline?
Hyperparameter tuning should always be combined with cross-validation to prevent overfitting the validation set. When your preprocessing steps have their own parameters (e.g., PCA's n_components, OneHotEncoder's drop), those must be tuned simultaneously with the model's parameters. Wrapping everything in a Pipeline and passing it to GridSearchCV ensures all of this is done correctly without leakage.
Passing a Pipeline to cross_val_score
The simplest way to evaluate a Pipeline fairly is cross_val_score(pipeline, X, y, cv=5). Each fold re-fits the entire pipeline — including scaler and classifier — on the training portion, then evaluates on the held-out fold. The mean and standard deviation of the returned array give you a reliable generalisation estimate.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine
import numpy as np
X, y = load_wine(return_X_y=True)
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(C=1.0, max_iter=300))
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
print(f'CV accuracy: {np.mean(scores):.4f} +/- {np.std(scores):.4f}')All lessons in this course
- Building Your First Pipeline: Scaler Plus Classifier
- ColumnTransformer Inside a Pipeline
- Cross-Validating and Grid-Searching a Full Pipeline
- Saving and Loading a Pipeline with joblib