Skip to content
Unverified — AI-generated content. Help verify this page

Gradient Boosting

Gradient boosting is the most powerful algorithm family for tabular data. It builds an ensemble of weak learners (usually shallow trees) sequentially, where each new tree corrects the errors of the previous ones. This page covers the theory from AdaBoost to XGBoost/LightGBM/CatBoost with full derivations.


Boosting vs Bagging

AspectBagging (Random Forest)Boosting (GBM)
TrainingParallel — trees are independentSequential — each tree depends on previous
GoalReduce varianceReduce bias
WeightingEqual weight for all treesLater trees weighted by learning rate
Tree depthDeep trees (low bias)Shallow trees (high bias, low variance)
Overfitting riskLowHigher — need regularization

AdaBoost

The Algorithm

AdaBoost (Adaptive Boosting) was the first successful boosting algorithm. It re-weights misclassified samples so subsequent classifiers focus on hard examples.

Algorithm:

  1. Initialize sample weights: wi=1n for all i
  2. For t=1,,T:
    • Train weak classifier ht on weighted data
    • Compute weighted error: ϵt=i:ht(xi)yiwiiwi
    • Compute classifier weight: αt=12ln1ϵtϵt
    • Update sample weights: wiwiexp(αtyiht(xi))
    • Normalize weights
  3. Final prediction: H(x)=sign(t=1Tαtht(x))
Worked Example — AdaBoost Weight Updates

Mini Dataset (6 samples, labels in {-1, +1}):

Sampleyh1(x)Correct?
1+1+1Yes
2+1+1Yes
3-1-1Yes
4-1+1No
5+1-1No
6-1-1Yes

Step 1: Initialize weights: w_i = 1/6 for all i

Step 2: Compute weighted error (samples 4 and 5 are wrong) epsilon_1 = w4 + w5 = 1/6 + 1/6 = 2/6 = 0.333

Step 3: Compute classifier weight alpha_1 = 0.5 * ln((1 - 0.333) / 0.333) = 0.5 * ln(0.667 / 0.333) = 0.5 * ln(2.0) = 0.5 * 0.693 = 0.347

Step 4: Update sample weights Correct predictions (y_i * h1(x_i) = +1): multiply by exp(-0.347) = 0.707 Wrong predictions (y_i * h1(x_i) = -1): multiply by exp(0.347) = 1.414

w1 = (1/6)(0.707) = 0.118 w2 = (1/6)(0.707) = 0.118 w3 = (1/6)(0.707) = 0.118 w4 = (1/6)(1.414) = 0.236 (misclassified -> weight doubled) w5 = (1/6)(1.414) = 0.236 (misclassified -> weight doubled) w6 = (1/6)(0.707) = 0.118

Step 5: Normalize (sum = 0.944) w1=w2=w3=w6 = 0.118/0.944 = 0.125 w4=w5 = 0.236/0.944 = 0.250

Interpret: "Misclassified samples (4 and 5) now have double the weight (0.250 vs 0.125). The next weak learner will focus more on getting these hard examples right."

python
# adaboost_scratch.py — AdaBoost from scratch
import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

class AdaBoostScratch:
    """AdaBoost classifier from scratch."""

    def __init__(self, n_estimators=50):
        self.n_estimators = n_estimators
        self.alphas = []
        self.stumps = []

    def fit(self, X, y):
        n_samples = len(y)
        # Convert labels to {-1, +1}
        y_signed = np.where(y == 1, 1, -1)

        # Initialize uniform weights
        weights = np.ones(n_samples) / n_samples

        for t in range(self.n_estimators):
            # Train weak learner (decision stump)
            stump = DecisionTreeClassifier(max_depth=1)
            stump.fit(X, y_signed, sample_weight=weights)
            pred = stump.predict(X)

            # Weighted error
            incorrect = (pred != y_signed)
            epsilon = np.sum(weights * incorrect) / np.sum(weights)

            # Avoid division by zero
            epsilon = np.clip(epsilon, 1e-10, 1 - 1e-10)

            # Classifier weight
            alpha = 0.5 * np.log((1 - epsilon) / epsilon)

            # Update sample weights
            weights *= np.exp(-alpha * y_signed * pred)
            weights /= weights.sum()  # normalize

            self.alphas.append(alpha)
            self.stumps.append(stump)

        return self

    def predict(self, X):
        # Weighted sum of predictions
        pred_sum = sum(
            alpha * stump.predict(X)
            for alpha, stump in zip(self.alphas, self.stumps)
        )
        return np.where(pred_sum >= 0, 1, 0)


# Test
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ada = AdaBoostScratch(n_estimators=100)
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)
print(f"AdaBoost (scratch) accuracy: {accuracy_score(y_test, y_pred):.4f}")

# Compare with sklearn
from sklearn.ensemble import AdaBoostClassifier
ada_sk = AdaBoostClassifier(n_estimators=100, random_state=42)
ada_sk.fit(X_train, y_train)
print(f"AdaBoost (sklearn) accuracy: {ada_sk.score(X_test, y_test):.4f}")

Gradient Boosting Machines (GBM)

The Key Insight

AdaBoost re-weights samples. Gradient boosting generalizes this by fitting each new tree to the negative gradient of the loss function — the pseudo-residuals.

For squared error loss L(y,F)=12(yF)2:

LF=yF(x)

The negative gradient is just the residual. So each new tree fits the residuals of the previous ensemble.

Worked Example — Gradient Boosting Residuals (3 Iterations)

Dataset: predict house prices. Learning rate eta = 0.5.

Housey (true)F0 (mean)Residual 1h1 predF1Residual 2h2 predF2
A200300-100-100250-50-50275
B3003000030000300
C4003001001003505050325

Step 1: Initialize F0 = mean(y) = (200+300+400)/3 = 300

Step 2: Iteration 1 — compute residuals and fit tree h1 Residuals = y - F0 = [-100, 0, 100] Tree h1 perfectly fits residuals: h1 = [-100, 0, 100] F1 = F0 + eta * h1 = [300+0.5*(-100), 300+0, 300+0.5*100] = [250, 300, 350]

Step 3: Iteration 2 — compute new residuals Residuals = y - F1 = [200-250, 300-300, 400-350] = [-50, 0, 50] Tree h2 fits: [-50, 0, 50] F2 = F1 + eta * h2 = [250-25, 300, 350+25] = [225, 300, 375]

Step 4: After many iterations, F converges to y

Interpret: "Each iteration halves the remaining error (because eta=0.5). After 2 rounds: House A error went from 100 to 75 to ~56. The learning rate prevents overfitting by taking small steps. Without it (eta=1), we'd overfit in one step."

GBM Derivation

Given a differentiable loss function L(y,F(x)):

  1. Initialize: F0(x)=argminci=1nL(yi,c)

  2. For t=1,,T:

    • Compute pseudo-residuals: rit=L(yi,Ft1(xi))Ft1(xi)
    • Fit a tree ht to pseudo-residuals: ht(x)rit
    • Find optimal step size: ρt=argminρi=1nL(yi,Ft1(xi)+ρht(xi))
    • Update: Ft(x)=Ft1(x)+ηρtht(x)

where η is the learning rate (shrinkage).

python
# gbm_scratch.py — Gradient Boosting from scratch for regression
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

class GradientBoostingRegressorScratch:
    """Gradient boosting regressor from scratch (squared error loss)."""

    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.lr = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.initial_prediction = None

    def fit(self, X, y):
        # Step 1: Initialize with mean
        self.initial_prediction = np.mean(y)
        F = np.full(len(y), self.initial_prediction)

        self.trees = []
        for t in range(self.n_estimators):
            # Step 2: Compute pseudo-residuals (negative gradient of MSE)
            residuals = y - F

            # Step 3: Fit tree to residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)

            # Step 4: Update predictions
            F += self.lr * tree.predict(X)
            self.trees.append(tree)

            if (t + 1) % 20 == 0:
                mse = mean_squared_error(y, F)
                print(f"  Iteration {t+1}: train MSE = {mse:.4f}")

        return self

    def predict(self, X):
        F = np.full(X.shape[0], self.initial_prediction)
        for tree in self.trees:
            F += self.lr * tree.predict(X)
        return F


# Test on California Housing
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Training from-scratch GBM:")
gbm = GradientBoostingRegressorScratch(n_estimators=100, learning_rate=0.1, max_depth=4)
gbm.fit(X_train, y_train)

y_pred = gbm.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nFrom-scratch RMSE: {rmse:.4f}")

# Compare with sklearn
from sklearn.ensemble import GradientBoostingRegressor
gbm_sk = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1,
                                    max_depth=4, random_state=42)
gbm_sk.fit(X_train, y_train)
y_pred_sk = gbm_sk.predict(X_test)
rmse_sk = np.sqrt(mean_squared_error(y_test, y_pred_sk))
print(f"sklearn RMSE:      {rmse_sk:.4f}")

XGBoost Internals

What XGBoost Adds to GBM

  1. Regularized objective: Adds L1 and L2 penalties on leaf weights
  2. Second-order approximation: Uses both gradient and Hessian
  3. Histogram-based splitting: Bins continuous features for speed
  4. Column subsampling: Like random forests, samples features
  5. Sparsity-aware: Handles missing values natively
  6. Parallel tree construction: Parallelizes feature-level splitting

Regularized Objective

XGBoost minimizes:

Obj=i=1nL(yi,y^i)+t=1TΩ(ft)

where the regularization term is:

Ω(f)=γT+12λj=1Twj2
  • T = number of leaves
  • wj = weight (prediction) in leaf j
  • γ = penalty per leaf (controls tree complexity)
  • λ = L2 regularization on leaf weights

Second-Order Taylor Expansion

At iteration t, XGBoost approximates the loss with a second-order Taylor expansion:

Obj(t)i=1n[gift(xi)+12hift(xi)2]+Ω(ft)+const

where:

  • gi=L(yi,y^i(t1))y^i(t1) (gradient)
  • hi=2L(yi,y^i(t1))(y^i(t1))2 (Hessian)

The optimal weight for leaf j is:

wj=iIjgiiIjhi+λ

The optimal split gain is:

Gain=12[(iILgi)2iILhi+λ+(iIRgi)2iIRhi+λ(iIgi)2iIhi+λ]γ
Worked Example — XGBoost Split Gain

6 samples with MSE loss (g_i = -(y_i - y_hat), h_i = 1 for all), lambda=1, gamma=0.5:

Sampleyy_hatg_i (gradient)h_i (hessian)
132.5-0.51
252.5-2.51
342.5-1.51
412.51.51
522.50.51
602.52.51

Split: samples {1,2,3} go left, {4,5,6} go right.

Step 1: Compute sums Left: sum(g_L) = -0.5 + (-2.5) + (-1.5) = -4.5, sum(h_L) = 3 Right: sum(g_R) = 1.5 + 0.5 + 2.5 = 4.5, sum(h_R) = 3 All: sum(g) = -4.5 + 4.5 = 0.0, sum(h) = 6

Step 2: Compute each term Left score = (-4.5)^2 / (3 + 1) = 20.25 / 4 = 5.0625 Right score = (4.5)^2 / (3 + 1) = 20.25 / 4 = 5.0625 Parent score = (0.0)^2 / (6 + 1) = 0 / 7 = 0.0

Step 3: Compute gain Gain = 0.5 * (5.0625 + 5.0625 - 0.0) - 0.5 = 0.5 * 10.125 - 0.5 = 5.0625 - 0.5 = 4.5625

Step 4: Optimal leaf weights w_L = -(-4.5)/(3+1) = 4.5/4 = 1.125 w_R = -(4.5)/(3+1) = -4.5/4 = -1.125

Interpret: "The gain of 4.56 exceeds gamma=0.5, so this split is worthwhile. Left leaf predicts +1.125 (high values), right leaf predicts -1.125 (low values)."

python
# xgboost_detailed.py — XGBoost with detailed configuration
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Detailed parameter configuration
params = {
    # Booster parameters
    'booster': 'gbtree',          # tree-based boosting
    'eta': 0.1,                   # learning rate
    'max_depth': 6,               # tree depth
    'min_child_weight': 1,        # min sum of hessian in child

    # Regularization
    'gamma': 0.1,                 # min loss reduction for split
    'lambda': 1.0,                # L2 regularization
    'alpha': 0.0,                 # L1 regularization

    # Sampling
    'subsample': 0.8,             # row sampling
    'colsample_bytree': 0.8,     # column sampling per tree
    'colsample_bylevel': 1.0,    # column sampling per level

    # Objective
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse',

    # Performance
    'tree_method': 'hist',        # histogram-based (fast)
    'seed': 42,
}

evals = [(dtrain, 'train'), (dtest, 'test')]
model = xgb.train(
    params, dtrain,
    num_boost_round=500,
    evals=evals,
    early_stopping_rounds=20,
    verbose_eval=50
)

y_pred = model.predict(dtest)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nXGBoost RMSE: {rmse:.4f}")
print(f"Best iteration: {model.best_iteration}")

LightGBM Innovations

GOSS (Gradient-based One-Side Sampling)

Instead of using all data, GOSS keeps all instances with large gradients (which contribute most to information gain) and randomly samples instances with small gradients. This speeds up training while maintaining accuracy.

EFB (Exclusive Feature Bundling)

Many features are mutually exclusive (never non-zero simultaneously). EFB bundles these features together, reducing the effective number of features.

Leaf-Wise vs Level-Wise Growth

  • Level-wise (XGBoost default): Grows tree layer by layer
  • Leaf-wise (LightGBM): Grows the leaf with maximum delta loss, creating asymmetric trees that can be more accurate
python
# lightgbm_detailed.py — LightGBM with tuning
import lightgbm as lgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# LightGBM native API
dtrain = lgb.Dataset(X_train, label=y_train)
dtest = lgb.Dataset(X_test, label=y_test, reference=dtrain)

params = {
    'objective': 'regression',
    'metric': 'rmse',
    'boosting_type': 'gbdt',     # or 'dart', 'goss'
    'num_leaves': 31,            # main complexity control
    'learning_rate': 0.1,
    'feature_fraction': 0.8,     # colsample_bytree
    'bagging_fraction': 0.8,     # subsample
    'bagging_freq': 5,           # bagging every 5 iterations
    'min_child_samples': 20,
    'lambda_l1': 0.1,
    'lambda_l2': 1.0,
    'verbose': -1,
    'seed': 42,
}

callbacks = [
    lgb.early_stopping(20),
    lgb.log_evaluation(50),
]

model = lgb.train(
    params, dtrain,
    num_boost_round=500,
    valid_sets=[dtrain, dtest],
    valid_names=['train', 'test'],
    callbacks=callbacks,
)

y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nLightGBM RMSE: {rmse:.4f}")
print(f"Best iteration: {model.best_iteration}")

# Feature importance
importance = model.feature_importance(importance_type='gain')
for name, imp in sorted(zip(housing.feature_names, importance), key=lambda x: -x[1])[:5]:
    print(f"  {name}: {imp:.1f}")

CatBoost: Ordered Boosting

Prediction Shift Problem

Standard boosting has a subtle bias: the model used to compute residuals was trained on the same data. CatBoost's ordered boosting solves this by maintaining a separate model for computing residuals for each training example, using only the examples that came before it in a random permutation.

Native Categorical Handling

CatBoost computes ordered target statistics for categorical features:

x^ki=j:π(j)<π(i),xjk=xikyj+aPj:π(j)<π(i),xjk=xik1+a

where π is a random permutation, a is a smoothing parameter, and P is the prior (target mean).

python
# catboost_detailed.py — CatBoost with categorical features
from catboost import CatBoostRegressor, Pool
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

housing = fetch_california_housing(as_frame=True)
df = housing.frame

# Add categorical features for demonstration
df['income_bin'] = pd.cut(df['MedInc'], bins=5, labels=['very_low', 'low', 'mid', 'high', 'very_high'])
df['age_bin'] = pd.cut(df['HouseAge'], bins=4, labels=['new', 'recent', 'old', 'very_old'])

X = df.drop('MedHouseVal', axis=1)
y = df['MedHouseVal']

cat_features = ['income_bin', 'age_bin']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = CatBoostRegressor(
    iterations=500,
    depth=6,
    learning_rate=0.1,
    cat_features=cat_features,
    l2_leaf_reg=3.0,
    random_seed=42,
    verbose=100,
    early_stopping_rounds=20,
)

model.fit(X_train, y_train, eval_set=(X_test, y_test))

y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"\nCatBoost RMSE: {rmse:.4f}")

Head-to-Head Benchmark

python
# benchmark.py — Comprehensive comparison
import numpy as np
import time
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.metrics import mean_squared_error, accuracy_score
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor, CatBoostClassifier

# --- Regression benchmark ---
housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

reg_models = {
    'sklearn GBM': GradientBoostingRegressor(
        n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42),
    'XGBoost': xgb.XGBRegressor(
        n_estimators=200, max_depth=5, learning_rate=0.1,
        random_state=42, eval_metric='rmse'),
    'LightGBM': lgb.LGBMRegressor(
        n_estimators=200, max_depth=5, learning_rate=0.1,
        random_state=42, verbose=-1),
    'CatBoost': CatBoostRegressor(
        iterations=200, depth=5, learning_rate=0.1,
        random_seed=42, verbose=0),
}

print("=== Regression: California Housing ===")
print(f"{'Model':<20} {'RMSE':>10} {'Train Time':>12}")
print("-" * 44)

for name, model in reg_models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    print(f"{name:<20} {rmse:>10.4f} {train_time:>12.3f}s")

# --- Classification benchmark ---
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

clf_models = {
    'sklearn GBM': GradientBoostingClassifier(
        n_estimators=200, max_depth=5, learning_rate=0.1, random_state=42),
    'XGBoost': xgb.XGBClassifier(
        n_estimators=200, max_depth=5, learning_rate=0.1,
        random_state=42, eval_metric='logloss'),
    'LightGBM': lgb.LGBMClassifier(
        n_estimators=200, max_depth=5, learning_rate=0.1,
        random_state=42, verbose=-1),
    'CatBoost': CatBoostClassifier(
        iterations=200, depth=5, learning_rate=0.1,
        random_seed=42, verbose=0),
}

print("\n=== Classification: Breast Cancer ===")
print(f"{'Model':<20} {'Accuracy':>10} {'Train Time':>12}")
print("-" * 44)

for name, model in clf_models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    acc = model.score(X_test, y_test)
    print(f"{name:<20} {acc:>10.4f} {train_time:>12.3f}s")

When to Use Which

LibraryBest ForKey AdvantageWeakness
sklearn GBMSmall data, learningSimple APISlow, no GPU
XGBoostGeneral purpose, competitionsWell-tuned defaults, regularizationSlower than LightGBM
LightGBMLarge data (>100K rows)Fastest trainingCan overfit with small data
CatBoostCategorical featuresNo encoding needed, robustSlower than LightGBM

Quick Selection Rule

  1. Have categorical features? Start with CatBoost
  2. Very large dataset? Start with LightGBM
  3. Competition / need best accuracy? Try all three, ensemble top two
  4. Learning / prototyping? Start with sklearn GBM

Further Reading

"What I cannot create, I do not understand." — Richard Feynman