XGBoost: Regularisation, Early Stopping, and Feature Importance
Learners will train an XGBClassifier, enable early stopping on a validation set, and plot feature importance scores to identify the most predictive columns.
What Is XGBoost?
XGBoost (eXtreme Gradient Boosting) is a highly optimised gradient boosting library that dominated Kaggle competitions from 2014 onward. It improves on scikit-learn's GradientBoostingClassifier in three major ways: (1) built-in L1 and L2 regularisation on tree weights to reduce overfitting, (2) a second-order Taylor expansion of the loss for more accurate gradient estimates, and (3) a highly efficient approximate histogram-based split finding algorithm that scales to datasets with millions of rows.
Installing and Importing XGBoost
XGBoost is a standalone library installed separately from scikit-learn. It provides a sklearn-compatible API through XGBClassifier and XGBRegressor, so you can use it with cross_val_score, GridSearchCV, and Pipelines just like any scikit-learn estimator. The native XGBoost API uses xgb.DMatrix and xgb.train(), offering more fine-grained control over early stopping and custom objectives.
# Install: pip install xgboost
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = xgb.XGBClassifier(n_estimators=200, learning_rate=0.1, max_depth=3,
use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_train, y_train)
print('XGBoost test accuracy:', model.score(X_test, y_test))All lessons in this course
- Boosting Intuition: Sequential Error Correction
- XGBoost: Regularisation, Early Stopping, and Feature Importance
- LightGBM: Leaf-Wise Growth and Speed Advantages
- Key Hyperparameters: Learning Rate, n_estimators, and max_depth