Choosing k: The Elbow Method and Validation Curves
Learners will sweep k from 1 to 30, plot validation accuracy, and identify the sweet spot that balances bias and variance.
Why Choosing k Matters
The value of k is the most important hyperparameter in KNN. Too small a k (e.g., k=1) makes the model highly sensitive to noise: each training point forms its own prediction region, causing the model to memorise noise rather than learn patterns. Too large a k over-smooths the decision boundary and may merge truly distinct classes. Finding the right k is a bias-variance problem: small k = low bias, high variance; large k = high bias, low variance. The elbow method and validation curves help identify the optimal k empirically.
# k=1: memorises training set perfectly
# Training accuracy = 100%, test accuracy low (overfitting)
# k=N (all neighbors): always predicts majority class
# Training accuracy = majority fraction (underfitting)
# Optimal k: somewhere in between
# Maximises test/validation accuracy
from sklearn.neighbors import KNeighborsClassifier
print('k=1 overfits (memorises noise)')
print('k=N underfits (ignores all variation)')
print('Best k: maximises cross-validated accuracy')Sweeping k Values: The Basic Loop
The simplest approach is to train KNN for a range of k values, evaluate each on a validation set, and pick the k with the highest validation accuracy. Scikit-learn makes this straightforward: loop over k from 1 to some maximum, fit and score each model. Always evaluate on a held-out validation set or use cross-validation — evaluating on the training set would always select k=1 (since with k=1 KNN predicts training points with 100% accuracy by memorising them).
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_tr = scaler.fit_transform(X_train)
X_v = scaler.transform(X_val)
val_scores = []
for k in range(1, 31):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_tr, y_train)
val_scores.append(knn.score(X_v, y_val))
best_k = val_scores.index(max(val_scores)) + 1
print('Best k:', best_k, 'with accuracy:', max(val_scores).round(3))All lessons in this course
- How KNN Works: Distance, Neighbors, and Votes
- Choosing k: The Elbow Method and Validation Curves
- Distance Metrics: Euclidean, Manhattan, and Minkowski
- KNN for Regression and Its Scalability Limits