train_test_split: Ratios, Seeds, and Stratification
Learners will use sklearn's train_test_split with various test sizes, set random seeds for reproducibility, and apply stratification to balanced splits.
The train_test_split Function
Scikit-learn's train_test_split() function is the standard tool for partitioning datasets into training and test subsets. It randomly shuffles the data and allocates the specified proportion to the test set, returning four arrays: X_train, X_test, y_train, y_test.
Understanding its parameters fully is important because poor splitting choices can compromise the validity of all subsequent evaluation. A too-small test set gives noisy estimates; a too-large test set wastes training data. The right choice depends on your dataset size and the variance you can afford in performance estimates.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.20, # 20% of data goes to test set
random_state=42, # reproducible random shuffle
shuffle=True, # default: shuffle before splitting
stratify=y # preserve class proportions
)
print(f'Training size: {X_train.shape}')
print(f'Test size: {X_test.shape}')Choosing the Test Size Ratio
The test_size parameter accepts either a float (proportion) or an integer (absolute count):
- Large datasets (>100k examples): use 10-15% test (there is plenty of training data; a large test set gives very stable estimates).
- Medium datasets (1k–100k): 20-30% test is standard (80/20 or 70/30 splits).
- Small datasets (<1k): consider cross-validation instead of a single split, because any fixed split may be unlucky.
There is no universally correct ratio. The goal is enough test examples for a statistically stable performance estimate, while keeping enough training data for the model to learn well.
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
# Compare different test sizes
for test_size in [0.1, 0.2, 0.3, 0.4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
acc = model.score(X_test, y_test)
print(f'test_size={test_size}: train={len(X_train)}, test={len(X_test)}, acc={acc:.3f}')All lessons in this course
- Why You Cannot Evaluate on Training Data
- train_test_split: Ratios, Seeds, and Stratification
- Bias-Variance Trade-off: Underfitting vs Overfitting
- Baseline Models: Always Beat the Dummy Classifier