Gradient Boosting
Gradient boosting is the most powerful algorithm family for tabular data. It builds an ensemble of weak learners (usually shallow trees) sequentially, where each new tree corrects the errors of the previous ones. This page covers the theory from AdaBoost to XGBoost/LightGBM/CatBoost with full derivations.
Boosting vs Bagging
| Aspect | Bagging (Random Forest) | Boosting (GBM) |
|---|---|---|
| Training | Parallel — trees are independent | Sequential — each tree depends on previous |
| Goal | Reduce variance | Reduce bias |
| Weighting | Equal weight for all trees | Later trees weighted by learning rate |
| Tree depth | Deep trees (low bias) | Shallow trees (high bias, low variance) |
| Overfitting risk | Low | Higher — need regularization |
AdaBoost
The Algorithm
AdaBoost (Adaptive Boosting) was the first successful boosting algorithm. It re-weights misclassified samples so subsequent classifiers focus on hard examples.
Algorithm:
- Initialize sample weights:
for all - For
: - Train weak classifier
on weighted data - Compute weighted error:
- Compute classifier weight:
- Update sample weights:
- Normalize weights
- Train weak classifier
- Final prediction:
Worked Example — AdaBoost Weight Updates
Mini Dataset (6 samples, labels in {-1, +1}):
| Sample | y | h1(x) | Correct? |
|---|---|---|---|
| 1 | +1 | +1 | Yes |
| 2 | +1 | +1 | Yes |
| 3 | -1 | -1 | Yes |
| 4 | -1 | +1 | No |
| 5 | +1 | -1 | No |
| 6 | -1 | -1 | Yes |
Step 1: Initialize weights: w_i = 1/6 for all i
Step 2: Compute weighted error (samples 4 and 5 are wrong) epsilon_1 = w4 + w5 = 1/6 + 1/6 = 2/6 = 0.333
Step 3: Compute classifier weight alpha_1 = 0.5 * ln((1 - 0.333) / 0.333) = 0.5 * ln(0.667 / 0.333) = 0.5 * ln(2.0) = 0.5 * 0.693 = 0.347
Step 4: Update sample weights Correct predictions (y_i * h1(x_i) = +1): multiply by exp(-0.347) = 0.707 Wrong predictions (y_i * h1(x_i) = -1): multiply by exp(0.347) = 1.414
w1 = (1/6)(0.707) = 0.118 w2 = (1/6)(0.707) = 0.118 w3 = (1/6)(0.707) = 0.118 w4 = (1/6)(1.414) = 0.236 (misclassified -> weight doubled) w5 = (1/6)(1.414) = 0.236 (misclassified -> weight doubled) w6 = (1/6)(0.707) = 0.118
Step 5: Normalize (sum = 0.944) w1=w2=w3=w6 = 0.118/0.944 = 0.125 w4=w5 = 0.236/0.944 = 0.250
Interpret: "Misclassified samples (4 and 5) now have double the weight (0.250 vs 0.125). The next weak learner will focus more on getting these hard examples right."
# adaboost_scratch.py — AdaBoost from scratch
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
class AdaBoostScratch:
"""AdaBoost classifier from scratch."""
def __init__(self, n_estimators=50):
self.n_estimators = n_estimators
self.alphas = []
self.stumps = []
def fit(self, X, y):
n_samples = len(y)
# Convert labels to {-1, +1}
y_signed = np.where(y == 1, 1, -1)
# Initialize uniform weights
weights = np.ones(n_samples) / n_samples
for t in range(self.n_estimators):
# Train weak learner (decision stump)
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X, y_signed, sample_weight=weights)
pred = stump.predict(X)
# Weighted error
incorrect = (pred != y_signed)
epsilon = np.sum(weights * incorrect) / np.sum(weights)
# Avoid division by zero
epsilon = np.clip(epsilon, 1e-10, 1 - 1e-10)
# Classifier weight
alpha = 0.5 * np.log((1 - epsilon) / epsilon)
# Update sample weights
weights *= np.exp(-alpha * y_signed * pred)
weights /= weights.sum() # normalize
self.alphas.append(alpha)
self.stumps.append(stump)
return self
def predict(self, X):
# Weighted sum of predictions
pred_sum = sum(
alpha * stump.predict(X)
for alpha, stump in zip(self.alphas, self.stumps)
)
return np.where(pred_sum >= 0, 1, 0)
# Test
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
ada = AdaBoostScratch(n_estimators=100)
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)
print(f"AdaBoost (scratch) accuracy: {accuracy_score(y_test, y_pred):.4f}")
# Compare with sklearn
from sklearn.ensemble import AdaBoostClassifier
ada_sk = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_sk.fit(X_train, y_train)
print(f"AdaBoost (sklearn) accuracy: {ada_sk.score(X_test, y_test):.4f}")Gradient Boosting Machines (GBM)
The Key Insight
AdaBoost re-weights samples. Gradient boosting generalizes this by fitting each new tree to the negative gradient of the loss function — the pseudo-residuals.
For squared error loss
The negative gradient is just the residual. So each new tree fits the residuals of the previous ensemble.
Worked Example — Gradient Boosting Residuals (3 Iterations)
Dataset: predict house prices. Learning rate eta = 0.5.
| House | y (true) | F0 (mean) | Residual 1 | h1 pred | F1 | Residual 2 | h2 pred | F2 |
|---|---|---|---|---|---|---|---|---|
| A | 200 | 300 | -100 | -100 | 250 | -50 | -50 | 275 |
| B | 300 | 300 | 0 | 0 | 300 | 0 | 0 | 300 |
| C | 400 | 300 | 100 | 100 | 350 | 50 | 50 | 325 |
Step 1: Initialize F0 = mean(y) = (200+300+400)/3 = 300
Step 2: Iteration 1 — compute residuals and fit tree h1 Residuals = y - F0 = [-100, 0, 100] Tree h1 perfectly fits residuals: h1 = [-100, 0, 100] F1 = F0 + eta * h1 = [300+0.5*(-100), 300+0, 300+0.5*100] = [250, 300, 350]
Step 3: Iteration 2 — compute new residuals Residuals = y - F1 = [200-250, 300-300, 400-350] = [-50, 0, 50] Tree h2 fits: [-50, 0, 50] F2 = F1 + eta * h2 = [250-25, 300, 350+25] = [225, 300, 375]
Step 4: After many iterations, F converges to y
Interpret: "Each iteration halves the remaining error (because eta=0.5). After 2 rounds: House A error went from 100 to 75 to ~56. The learning rate prevents overfitting by taking small steps. Without it (eta=1), we'd overfit in one step."
GBM Derivation
Given a differentiable loss function
Initialize:
For
: - Compute pseudo-residuals:
- Fit a tree
to pseudo-residuals: - Find optimal step size:
- Update:
- Compute pseudo-residuals:
where
# gbm_scratch.py — Gradient Boosting from scratch for regression
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
class GradientBoostingRegressorScratch:
"""Gradient boosting regressor from scratch (squared error loss)."""
def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
self.n_estimators = n_estimators
self.lr = learning_rate
self.max_depth = max_depth
self.trees = []
self.initial_prediction = None
def fit(self, X, y):
# Step 1: Initialize with mean
self.initial_prediction = np.mean(y)
F = np.full(len(y), self.initial_prediction)
self.trees = []
for t in range(self.n_estimators):
# Step 2: Compute pseudo-residuals (negative gradient of MSE)
residuals = y - F
# Step 3: Fit tree to residuals
tree = DecisionTreeRegressor(max_depth=self.max_depth)
tree.fit(X, residuals)
# Step 4: Update predictions
F += self.lr * tree.predict(X)
self.trees.append(tree)
if (t + 1) % 20 == 0:
mse = mean_squared_error(y, F)
print(f" Iteration {t+1}: train MSE = {mse:.4f}")
return self
def predict(self, X):
F = np.full(X.shape[0], self.initial_prediction)
for tree in self.trees:
F += self.lr * tree.predict(X)
return F
# Test on California Housing
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("Training from-scratch GBM:")
gbm = GradientBoostingRegressorScratch(n_estimators=100, learning_rate=0.1, max_depth=4)
gbm.fit(X_train, y_train)
y_pred = gbm.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nFrom-scratch RMSE: {rmse:.4f}")
# Compare with sklearn
from sklearn.ensemble import GradientBoostingRegressor
gbm_sk = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
max_depth=4, random_state=42)
gbm_sk.fit(X_train, y_train)
y_pred_sk = gbm_sk.predict(X_test)
rmse_sk = np.sqrt(mean_squared_error(y_test, y_pred_sk))
print(f"sklearn RMSE: {rmse_sk:.4f}")XGBoost Internals
What XGBoost Adds to GBM
- Regularized objective: Adds L1 and L2 penalties on leaf weights
- Second-order approximation: Uses both gradient and Hessian
- Histogram-based splitting: Bins continuous features for speed
- Column subsampling: Like random forests, samples features
- Sparsity-aware: Handles missing values natively
- Parallel tree construction: Parallelizes feature-level splitting
Regularized Objective
XGBoost minimizes:
where the regularization term is:
= number of leaves = weight (prediction) in leaf = penalty per leaf (controls tree complexity) = L2 regularization on leaf weights
Second-Order Taylor Expansion
At iteration
where:
(gradient) (Hessian)
The optimal weight for leaf
The optimal split gain is:
Worked Example — XGBoost Split Gain
6 samples with MSE loss (g_i = -(y_i - y_hat), h_i = 1 for all), lambda=1, gamma=0.5:
| Sample | y | y_hat | g_i (gradient) | h_i (hessian) |
|---|---|---|---|---|
| 1 | 3 | 2.5 | -0.5 | 1 |
| 2 | 5 | 2.5 | -2.5 | 1 |
| 3 | 4 | 2.5 | -1.5 | 1 |
| 4 | 1 | 2.5 | 1.5 | 1 |
| 5 | 2 | 2.5 | 0.5 | 1 |
| 6 | 0 | 2.5 | 2.5 | 1 |
Split: samples {1,2,3} go left, {4,5,6} go right.
Step 1: Compute sums Left: sum(g_L) = -0.5 + (-2.5) + (-1.5) = -4.5, sum(h_L) = 3 Right: sum(g_R) = 1.5 + 0.5 + 2.5 = 4.5, sum(h_R) = 3 All: sum(g) = -4.5 + 4.5 = 0.0, sum(h) = 6
Step 2: Compute each term Left score = (-4.5)^2 / (3 + 1) = 20.25 / 4 = 5.0625 Right score = (4.5)^2 / (3 + 1) = 20.25 / 4 = 5.0625 Parent score = (0.0)^2 / (6 + 1) = 0 / 7 = 0.0
Step 3: Compute gain Gain = 0.5 * (5.0625 + 5.0625 - 0.0) - 0.5 = 0.5 * 10.125 - 0.5 = 5.0625 - 0.5 = 4.5625
Step 4: Optimal leaf weights w_L = -(-4.5)/(3+1) = 4.5/4 = 1.125 w_R = -(4.5)/(3+1) = -4.5/4 = -1.125
Interpret: "The gain of 4.56 exceeds gamma=0.5, so this split is worthwhile. Left leaf predicts +1.125 (high values), right leaf predicts -1.125 (low values)."
# xgboost_detailed.py — XGBoost with detailed configuration
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
# Detailed parameter configuration
params = {
# Booster parameters
'booster': 'gbtree', # tree-based boosting
'eta': 0.1, # learning rate
'max_depth': 6, # tree depth
'min_child_weight': 1, # min sum of hessian in child
# Regularization
'gamma': 0.1, # min loss reduction for split
'lambda': 1.0, # L2 regularization
'alpha': 0.0, # L1 regularization
# Sampling
'subsample': 0.8, # row sampling
'colsample_bytree': 0.8, # column sampling per tree
'colsample_bylevel': 1.0, # column sampling per level
# Objective
'objective': 'reg:squarederror',
'eval_metric': 'rmse',
# Performance
'tree_method': 'hist', # histogram-based (fast)
'seed': 42,
}
evals = [(dtrain, 'train'), (dtest, 'test')]
model = xgb.train(
params, dtrain,
num_boost_round=500,
evals=evals,
early_stopping_rounds=20,
verbose_eval=50
)
y_pred = model.predict(dtest)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nXGBoost RMSE: {rmse:.4f}")
print(f"Best iteration: {model.best_iteration}")LightGBM Innovations
GOSS (Gradient-based One-Side Sampling)
Instead of using all data, GOSS keeps all instances with large gradients (which contribute most to information gain) and randomly samples instances with small gradients. This speeds up training while maintaining accuracy.
EFB (Exclusive Feature Bundling)
Many features are mutually exclusive (never non-zero simultaneously). EFB bundles these features together, reducing the effective number of features.
Leaf-Wise vs Level-Wise Growth
- Level-wise (XGBoost default): Grows tree layer by layer
- Leaf-wise (LightGBM): Grows the leaf with maximum delta loss, creating asymmetric trees that can be more accurate
# lightgbm_detailed.py — LightGBM with tuning
import lightgbm as lgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# LightGBM native API
dtrain = lgb.Dataset(X_train, label=y_train)
dtest = lgb.Dataset(X_test, label=y_test, reference=dtrain)
params = {
'objective': 'regression',
'metric': 'rmse',
'boosting_type': 'gbdt', # or 'dart', 'goss'
'num_leaves': 31, # main complexity control
'learning_rate': 0.1,
'feature_fraction': 0.8, # colsample_bytree
'bagging_fraction': 0.8, # subsample
'bagging_freq': 5, # bagging every 5 iterations
'min_child_samples': 20,
'lambda_l1': 0.1,
'lambda_l2': 1.0,
'verbose': -1,
'seed': 42,
}
callbacks = [
lgb.early_stopping(20),
lgb.log_evaluation(50),
]
model = lgb.train(
params, dtrain,
num_boost_round=500,
valid_sets=[dtrain, dtest],
valid_names=['train', 'test'],
callbacks=callbacks,
)
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nLightGBM RMSE: {rmse:.4f}")
print(f"Best iteration: {model.best_iteration}")
# Feature importance
importance = model.feature_importance(importance_type='gain')
for name, imp in sorted(zip(housing.feature_names, importance), key=lambda x: -x[1])[:5]:
print(f" {name}: {imp:.1f}")CatBoost: Ordered Boosting
Prediction Shift Problem
Standard boosting has a subtle bias: the model used to compute residuals was trained on the same data. CatBoost's ordered boosting solves this by maintaining a separate model for computing residuals for each training example, using only the examples that came before it in a random permutation.
Native Categorical Handling
CatBoost computes ordered target statistics for categorical features:
where
# catboost_detailed.py — CatBoost with categorical features
from catboost import CatBoostRegressor, Pool
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd
housing = fetch_california_housing(as_frame=True)
df = housing.frame
# Add categorical features for demonstration
df['income_bin'] = pd.cut(df['MedInc'], bins=5, labels=['very_low', 'low', 'mid', 'high', 'very_high'])
df['age_bin'] = pd.cut(df['HouseAge'], bins=4, labels=['new', 'recent', 'old', 'very_old'])
X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']
cat_features = ['income_bin', 'age_bin']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = CatBoostRegressor(
iterations=500,
depth=6,
learning_rate=0.1,
cat_features=cat_features,
l2_leaf_reg=3.0,
random_seed=42,
verbose=100,
early_stopping_rounds=20,
)
model.fit(X_train, y_train, eval_set=(X_test, y_test))
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nCatBoost RMSE: {rmse:.4f}")Head-to-Head Benchmark
# benchmark.py — Comprehensive comparison
import numpy as np
import time
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_squared_error, accuracy_score
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor, CatBoostClassifier
# --- Regression benchmark ---
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
reg_models = {
'sklearn GBM': GradientBoostingRegressor(
n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42),
'XGBoost': xgb.XGBRegressor(
n_estimators=200, max_depth=5, learning_rate=0.1,
random_state=42, eval_metric='rmse'),
'LightGBM': lgb.LGBMRegressor(
n_estimators=200, max_depth=5, learning_rate=0.1,
random_state=42, verbose=-1),
'CatBoost': CatBoostRegressor(
iterations=200, depth=5, learning_rate=0.1,
random_seed=42, verbose=0),
}
print("=== Regression: California Housing ===")
print(f"{'Model':<20} {'RMSE':>10} {'Train Time':>12}")
print("-" * 44)
for name, model in reg_models.items():
start = time.time()
model.fit(X_train, y_train)
train_time = time.time() - start
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"{name:<20} {rmse:>10.4f} {train_time:>12.3f}s")
# --- Classification benchmark ---
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
clf_models = {
'sklearn GBM': GradientBoostingClassifier(
n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42),
'XGBoost': xgb.XGBClassifier(
n_estimators=200, max_depth=5, learning_rate=0.1,
random_state=42, eval_metric='logloss'),
'LightGBM': lgb.LGBMClassifier(
n_estimators=200, max_depth=5, learning_rate=0.1,
random_state=42, verbose=-1),
'CatBoost': CatBoostClassifier(
iterations=200, depth=5, learning_rate=0.1,
random_seed=42, verbose=0),
}
print("\n=== Classification: Breast Cancer ===")
print(f"{'Model':<20} {'Accuracy':>10} {'Train Time':>12}")
print("-" * 44)
for name, model in clf_models.items():
start = time.time()
model.fit(X_train, y_train)
train_time = time.time() - start
acc = model.score(X_test, y_test)
print(f"{name:<20} {acc:>10.4f} {train_time:>12.3f}s")When to Use Which
| Library | Best For | Key Advantage | Weakness |
|---|---|---|---|
| sklearn GBM | Small data, learning | Simple API | Slow, no GPU |
| XGBoost | General purpose, competitions | Well-tuned defaults, regularization | Slower than LightGBM |
| LightGBM | Large data (>100K rows) | Fastest training | Can overfit with small data |
| CatBoost | Categorical features | No encoding needed, robust | Slower than LightGBM |
Quick Selection Rule
- Have categorical features? Start with CatBoost
- Very large dataset? Start with LightGBM
- Competition / need best accuracy? Try all three, ensemble top two
- Learning / prototyping? Start with sklearn GBM
Further Reading
- Decision Trees — The weak learner used in boosting
- Random Forests — Bagging alternative to boosting
- Python ML Ecosystem — Installation and API comparison
- Evaluation Metrics — How to measure boosting performance