Machine Learning Academy · Lesson

train_test_split: Ratios, Seeds, and Stratification

Learners will use sklearn's train_test_split with various test sizes, set random seeds for reproducibility, and apply stratification to balanced splits.

The train_test_split Function

Scikit-learn's train_test_split() function is the standard tool for partitioning datasets into training and test subsets. It randomly shuffles the data and allocates the specified proportion to the test set, returning four arrays: X_train, X_test, y_train, y_test.

Understanding its parameters fully is important because poor splitting choices can compromise the validity of all subsequent evaluation. A too-small test set gives noisy estimates; a too-large test set wastes training data. The right choice depends on your dataset size and the variance you can afford in performance estimates.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,     # 20% of data goes to test set
    random_state=42,    # reproducible random shuffle
    shuffle=True,       # default: shuffle before splitting
    stratify=y          # preserve class proportions
)

print(f'Training size: {X_train.shape}')
print(f'Test size:     {X_test.shape}')

Choosing the Test Size Ratio

The test_size parameter accepts either a float (proportion) or an integer (absolute count):

Large datasets (>100k examples): use 10-15% test (there is plenty of training data; a large test set gives very stable estimates).
Medium datasets (1k–100k): 20-30% test is standard (80/20 or 70/30 splits).
Small datasets (<1k): consider cross-validation instead of a single split, because any fixed split may be unlucky.

There is no universally correct ratio. The goal is enough test examples for a statistically stable performance estimate, while keeping enough training data for the model to learn well.

from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
import numpy as np

X, y = load_breast_cancer(return_X_y=True)

# Compare different test sizes
for test_size in [0.1, 0.2, 0.3, 0.4]:
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    print(f'test_size={test_size}: train={len(X_train)}, test={len(X_test)}, acc={acc:.3f}')

All lessons in this course

Why You Cannot Evaluate on Training Data
train_test_split: Ratios, Seeds, and Stratification
Bias-Variance Trade-off: Underfitting vs Overfitting
Baseline Models: Always Beat the Dummy Classifier

← Back to Machine Learning Academy