Baseline Models: Always Beat the Dummy Classifier
Learners will build a DummyClassifier as a minimum baseline and confirm that every real model must outperform it before being considered useful.
Why You Need a Baseline
When you report that your model achieves 92% accuracy, that number is meaningless without context. Is 92% impressive or disappointing? It depends entirely on what the simplest possible strategy would achieve on the same problem.
A baseline model is the floor that every real model must beat to be considered useful. Without a baseline, you might celebrate 92% accuracy on a dataset where a model that always predicts 'not fraud' would achieve 95% accuracy — which means your fancy ML model is actually worse than doing nothing.
The DummyClassifier: scikit-learn's Baseline Tool
Scikit-learn provides DummyClassifier as a minimal classifier that makes predictions using simple rules completely ignorant of the input features. It is the formal tool for establishing a baseline before any real modelling begins.
DummyClassifier is not a joke or placeholder — it is a rigorous sanity check. If your real model cannot beat the DummyClassifier, something is fundamentally wrong: your features may not be predictive, your pipeline has a bug, or the problem is harder than expected.
from sklearn.dummy import DummyClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Most frequent baseline: always predicts the majority class
dummy = DummyClassifier(strategy='most_frequent')
dummy.fit(X_train, y_train)
baseline_acc = dummy.score(X_test, y_test)
print(f'Baseline accuracy (always predict majority class): {baseline_acc:.3f}')All lessons in this course
- Why You Cannot Evaluate on Training Data
- train_test_split: Ratios, Seeds, and Stratification
- Bias-Variance Trade-off: Underfitting vs Overfitting
- Baseline Models: Always Beat the Dummy Classifier