0PricingLogin
Machine Learning Academy · Lesson

Choosing k: The Elbow Method and Validation Curves

Learners will sweep k from 1 to 30, plot validation accuracy, and identify the sweet spot that balances bias and variance.

Why Choosing k Matters

The value of k is the most important hyperparameter in KNN. Too small a k (e.g., k=1) makes the model highly sensitive to noise: each training point forms its own prediction region, causing the model to memorise noise rather than learn patterns. Too large a k over-smooths the decision boundary and may merge truly distinct classes. Finding the right k is a bias-variance problem: small k = low bias, high variance; large k = high bias, low variance. The elbow method and validation curves help identify the optimal k empirically.

# k=1: memorises training set perfectly
# Training accuracy = 100%, test accuracy low (overfitting)

# k=N (all neighbors): always predicts majority class
# Training accuracy = majority fraction (underfitting)

# Optimal k: somewhere in between
# Maximises test/validation accuracy

from sklearn.neighbors import KNeighborsClassifier
print('k=1  overfits (memorises noise)')
print('k=N  underfits (ignores all variation)')
print('Best k: maximises cross-validated accuracy')

Sweeping k Values: The Basic Loop

The simplest approach is to train KNN for a range of k values, evaluate each on a validation set, and pick the k with the highest validation accuracy. Scikit-learn makes this straightforward: loop over k from 1 to some maximum, fit and score each model. Always evaluate on a held-out validation set or use cross-validation — evaluating on the training set would always select k=1 (since with k=1 KNN predicts training points with 100% accuracy by memorising them).

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_tr = scaler.fit_transform(X_train)
X_v  = scaler.transform(X_val)

val_scores = []
for k in range(1, 31):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_tr, y_train)
    val_scores.append(knn.score(X_v, y_val))

best_k = val_scores.index(max(val_scores)) + 1
print('Best k:', best_k, 'with accuracy:', max(val_scores).round(3))

All lessons in this course

  1. How KNN Works: Distance, Neighbors, and Votes
  2. Choosing k: The Elbow Method and Validation Curves
  3. Distance Metrics: Euclidean, Manhattan, and Minkowski
  4. KNN for Regression and Its Scalability Limits
← Back to Machine Learning Academy