Skip to content
Unverified — AI-generated content. Help verify this page

Ensemble Methods

An ensemble combines multiple models to produce better predictions than any individual model. The idea is old — the "wisdom of crowds" principle — but the mathematics behind why it works is rigorous and powerful. Ensemble methods have dominated machine learning competitions for a decade and remain the go-to approach for production tabular data models.

Why Ensembles Work

The Diversity Theorem

For regression with M models, the ensemble's squared error can be decomposed as:

MSEensemble=MSEdiversity

where MSE is the average individual error and diversity measures how differently the models err. More diversity = lower ensemble error.

Mathematical Proof for Averaging

Consider M independent models, each with error ϵm where E[ϵm]=0 and Var(ϵm)=σ2.

The ensemble prediction is f^ens=1Mm=1Mf^m, so the ensemble error is:

ϵens=1Mm=1MϵmVar(ϵens)=1M2m=1MVar(ϵm)=σ2M
Worked Example — Ensemble Variance Reduction

5 independent models, each with error variance sigma^2 = 0.20:

Step 1: Single model variance Var(epsilon_1) = 0.20

Step 2: Ensemble of M=5 (average prediction) Var(epsilon_ens) = sigma^2 / M = 0.20 / 5 = 0.04

Step 3: Ensemble of M=20 Var(epsilon_ens) = 0.20 / 20 = 0.01

Step 4: With correlated models (rho=0.5, M=5) Var = rhosigma^2 + (1-rho)/M * sigma^2 = 0.50.20 + (0.5/5)*0.20 = 0.10 + 0.02 = 0.12

Interpret: "With 5 independent models, variance drops from 0.20 to 0.04 (5x reduction). But if models are correlated (rho=0.5), variance only drops to 0.12 (1.7x). As M goes to infinity, the variance floor is rho*sigma^2 = 0.10 — you can never get below this with correlated models. Diversity is critical."

The variance drops by a factor of M. But this assumes independence — which is never perfectly true.

With Correlated Models

If each pair of errors has correlation ρ:

Var(ϵens)=1M2(Mσ2+M(M1)ρσ2)=ρσ2+1ρMσ2

As M: Varρσ2.

Conclusion: The ensemble variance is bounded by the correlation between models. Diversity is crucial — uncorrelated errors cancel; correlated errors do not.


Bagging (Bootstrap Aggregating)

Algorithm

  1. Create M bootstrap samples from the training data (sample n points with replacement)
  2. Train one model on each bootstrap sample
  3. Aggregate: Average (regression) or majority vote (classification)

Why Bootstrap Introduces Diversity

Each bootstrap sample contains approximately 63.2% of unique training points (the rest are duplicates). The remaining 36.8% are "out-of-bag" (OOB) samples:

P(point not in bootstrap)=(11n)ne10.368
Worked Example — Bootstrap OOB Probability

Dataset with n=5 samples. What fraction is OOB for each tree?

Step 1: Probability a specific point is NOT picked in one draw P(miss) = 1 - 1/5 = 4/5 = 0.8

Step 2: Probability NOT picked in any of n=5 draws (with replacement) P(OOB) = (4/5)^5 = 0.8^5 = 0.32768

Step 3: For larger n n=10: P(OOB) = (9/10)^10 = 0.349 n=100: P(OOB) = (99/100)^100 = 0.366 n=1000: P(OOB) = (999/1000)^1000 = 0.368

Step 4: Unique samples in bootstrap Expected unique = n * (1 - P(OOB)) For n=1000: 1000 * (1 - 0.368) = 632 unique samples

Interpret: "About 36.8% of data is left out of each bootstrap sample (OOB). These OOB samples serve as free validation data. For n=1000, each tree trains on ~632 unique samples and is validated on ~368."

Different models see different subsets, creating diversity.

Bagging From Scratch

python
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score

class BaggingFromScratch:
    """Bagging classifier from scratch."""

    def __init__(self, base_estimator=None, n_estimators=50, random_state=42):
        self.base_estimator = base_estimator or DecisionTreeClassifier()
        self.n_estimators = n_estimators
        self.random_state = random_state

    def fit(self, X, y):
        self.models_ = []
        self.oob_predictions_ = np.zeros((len(X), len(np.unique(y))))
        self.oob_counts_ = np.zeros(len(X))
        rng = np.random.RandomState(self.random_state)

        for i in range(self.n_estimators):
            # Bootstrap sample
            idx = rng.choice(len(X), size=len(X), replace=True)
            oob_idx = np.setdiff1d(np.arange(len(X)), np.unique(idx))

            X_boot, y_boot = X[idx], y[idx]

            # Train model
            from sklearn.base import clone
            model = clone(self.base_estimator)
            model.set_params(random_state=self.random_state + i)
            model.fit(X_boot, y_boot)
            self.models_.append(model)

            # OOB predictions
            if len(oob_idx) > 0:
                oob_proba = model.predict_proba(X[oob_idx])
                self.oob_predictions_[oob_idx] += oob_proba
                self.oob_counts_[oob_idx] += 1

        # OOB score
        valid = self.oob_counts_ > 0
        oob_pred = np.argmax(self.oob_predictions_[valid] /
                              self.oob_counts_[valid, np.newaxis], axis=1)
        self.oob_score_ = np.mean(oob_pred == y[valid])

        return self

    def predict(self, X):
        """Majority vote."""
        predictions = np.array([m.predict(X) for m in self.models_])
        from scipy.stats import mode
        return mode(predictions, axis=0, keepdims=False).mode

    def predict_proba(self, X):
        """Average probabilities."""
        probas = np.array([m.predict_proba(X) for m in self.models_])
        return probas.mean(axis=0)


# Compare single tree vs bagging
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

tree = DecisionTreeClassifier(random_state=42)
scores_tree = cross_val_score(tree, X, y, cv=5, scoring='accuracy')
print(f"Single Decision Tree: {scores_tree.mean():.4f} +/- {scores_tree.std():.4f}")

bagging = BaggingFromScratch(n_estimators=100, random_state=42)
bagging.fit(X, y)
print(f"Bagging (100 trees) OOB: {bagging.oob_score_:.4f}")

Random Forest = Bagging + Feature Subsampling

Random Forest adds another source of diversity: at each split, only a random subset of features is considered:

mtry=d (classification)ormtry=d/3 (regression)

This decorrelates the trees (reduces ρ), which further reduces ensemble variance.


Boosting

Core Idea

Instead of training models independently (bagging), boosting trains them sequentially. Each new model focuses on the mistakes of the previous ensemble.

AdaBoost Algorithm

Given data (x1,y1),,(xn,yn) where yi{1,+1}:

  1. Initialize weights: wi(1)=1n

  2. For t=1,,T: a. Train weak learner ht on weighted data b. Compute weighted error: ϵt=i:ht(xi)yiwi(t) c. Compute model weight: αt=12ln1ϵtϵt d. Update sample weights: wi(t+1)=wi(t)exp(αtyiht(xi)) e. Normalize weights: wi(t+1)=wi(t+1)jwj(t+1)

  3. Final prediction: H(x)=sign(t=1Tαtht(x))

Key insight: Misclassified samples get higher weights, so the next learner focuses on them.

Gradient Boosting

Generalizes boosting to any differentiable loss function. Instead of reweighting samples, fit each new model to the negative gradient (pseudo-residuals) of the loss:

For loss function L(y,F(x)) and current ensemble Fm(x):

  1. Compute pseudo-residuals: rim=L(yi,F(xi))F(xi)|F=Fm
  2. Fit a weak learner hm to the pseudo-residuals
  3. Update: Fm+1(x)=Fm(x)+ηhm(x) where η is the learning rate

For MSE loss: L=12(yF)2, the pseudo-residual is ri=yiFm(xi) — the actual residual. Hence the name "gradient boosting on residuals."

Gradient Boosting From Scratch

python
import numpy as np
from sklearn.tree import DecisionTreeRegressor

class GradientBoostingFromScratch:
    """Gradient boosting for regression (MSE loss)."""

    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3,
                 random_state=42):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.random_state = random_state

    def fit(self, X, y):
        # Initialize with mean
        self.init_prediction_ = y.mean()
        self.trees_ = []

        F = np.full(len(y), self.init_prediction_)
        self.train_losses_ = []

        for i in range(self.n_estimators):
            # Pseudo-residuals (negative gradient of MSE)
            residuals = y - F

            # Fit tree to residuals
            tree = DecisionTreeRegressor(max_depth=self.max_depth,
                                         random_state=self.random_state + i)
            tree.fit(X, residuals)
            self.trees_.append(tree)

            # Update predictions
            update = self.learning_rate * tree.predict(X)
            F += update

            # Track loss
            mse = np.mean((y - F) ** 2)
            self.train_losses_.append(mse)

        return self

    def predict(self, X):
        F = np.full(X.shape[0], self.init_prediction_)
        for tree in self.trees_:
            F += self.learning_rate * tree.predict(X)
        return F


# Demo
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
    diabetes.data, diabetes.target, test_size=0.2, random_state=42
)

gb = GradientBoostingFromScratch(n_estimators=200, learning_rate=0.1, max_depth=3)
gb.fit(X_train, y_train)

y_pred = gb.predict(X_test)
mse = np.mean((y_test - y_pred) ** 2)
print(f"Gradient Boosting (scratch) MSE: {mse:.2f}")

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(gb.train_losses_, 'b-')
plt.xlabel('Number of Trees')
plt.ylabel('Training MSE')
plt.title('Gradient Boosting Training Loss')
plt.grid(True, alpha=0.3)
plt.savefig('gb_training_loss.png', dpi=150, bbox_inches='tight')
plt.show()

Bagging vs Boosting

AspectBaggingBoosting
TrainingParallel (independent models)Sequential (dependent models)
What it reducesVarianceBias (and variance)
Overfitting riskLow (more trees rarely hurts)Higher (more trees can overfit)
Best base learnerHigh-variance (deep trees)High-bias (shallow stumps)
SpeedParallelizableSequential
Robustness to noiseMore robustSensitive (upweights noisy samples)
Typical algorithmsRandom Forest, Extra TreesAdaBoost, XGBoost, LightGBM, CatBoost

Stacking (Stacked Generalization)

Architecture

Level 0 (Base Models): Train diverse models on the training data Level 1 (Meta-Learner): Train a model on the base models' out-of-fold predictions

The key insight: base models make different kinds of errors. The meta-learner learns when to trust which model.

Stacking From Scratch

python
from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
import numpy as np

class StackingFromScratch:
    """Two-level stacking classifier."""

    def __init__(self, base_models, meta_model, cv=5):
        self.base_models = base_models
        self.meta_model = meta_model
        self.cv = cv

    def fit(self, X, y):
        n_samples = len(y)
        n_models = len(self.base_models)
        self.fitted_base_models_ = []

        # Generate out-of-fold predictions for meta-features
        meta_features = np.zeros((n_samples, n_models))
        kf = StratifiedKFold(n_splits=self.cv, shuffle=True, random_state=42)

        for model_idx, (name, model) in enumerate(self.base_models):
            fold_models = []
            for train_idx, val_idx in kf.split(X, y):
                m = clone(model)
                m.fit(X[train_idx], y[train_idx])
                meta_features[val_idx, model_idx] = m.predict_proba(X[val_idx])[:, 1]
                fold_models.append(m)
            self.fitted_base_models_.append(fold_models)

        # Refit base models on full data for test-time predictions
        self.full_base_models_ = []
        for name, model in self.base_models:
            m = clone(model)
            m.fit(X, y)
            self.full_base_models_.append(m)

        # Train meta-learner on out-of-fold predictions
        self.meta_model_ = clone(self.meta_model)
        self.meta_model_.fit(meta_features, y)

        return self

    def predict(self, X):
        meta_features = np.column_stack([
            m.predict_proba(X)[:, 1] for m in self.full_base_models_
        ])
        return self.meta_model_.predict(meta_features)

    def predict_proba(self, X):
        meta_features = np.column_stack([
            m.predict_proba(X)[:, 1] for m in self.full_base_models_
        ])
        return self.meta_model_.predict_proba(meta_features)


# ---- Demo ----
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

base_models = [
    ('lr', make_pipeline(StandardScaler(), LogisticRegression(max_iter=1000))),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('svm', make_pipeline(StandardScaler(), SVC(probability=True, random_state=42))),
]

meta = LogisticRegression(max_iter=1000)

stacker = StackingFromScratch(base_models, meta, cv=5)
stacker.fit(X, y)

# Evaluate with fresh CV
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import StackingClassifier

stack_sklearn = StackingClassifier(
    estimators=base_models,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5,
    passthrough=False
)

scores_stack = cross_val_score(stack_sklearn, X, y, cv=5, scoring='accuracy')
print(f"Stacking: {scores_stack.mean():.4f} +/- {scores_stack.std():.4f}")

# Compare with individual models
for name, model in base_models:
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:>5}: {scores.mean():.4f} +/- {scores.std():.4f}")

Diversity Analysis

Diversity between models is critical for ensemble success. Measure it:

python
from sklearn.metrics import cohen_kappa_score
import itertools

def ensemble_diversity(X, y, models):
    """Analyze diversity between ensemble members."""
    predictions = {}
    for name, model in models:
        m = clone(model)
        m.fit(X, y)
        predictions[name] = m.predict(X)

    # Pairwise disagreement rate
    print("Pairwise Disagreement Rate:")
    for (n1, p1), (n2, p2) in itertools.combinations(predictions.items(), 2):
        disagreement = np.mean(p1 != p2)
        kappa = cohen_kappa_score(p1, p2)
        print(f"  {n1} vs {n2}: disagreement={disagreement:.3f}, kappa={kappa:.3f}")

    # Error correlation
    errors = {name: (pred != y).astype(float) for name, pred in predictions.items()}
    print("\nError Correlation Matrix:")
    names = list(errors.keys())
    for n1 in names:
        corrs = [np.corrcoef(errors[n1], errors[n2])[0, 1] for n2 in names]
        print(f"  {n1:>5}: {[f'{c:.3f}' for c in corrs]}")

ensemble_diversity(X, y, base_models)

Kaggle Competition Strategies

1. The Stacking Recipe

Level 0 (diverse base models):
  - LightGBM (3-5 different configs)
  - XGBoost (3-5 different configs)
  - CatBoost (2-3 configs)
  - Neural Network (1-2)
  - ExtraTrees / Random Forest (1-2)

Level 1 (meta-learner):
  - Logistic Regression (simple, less overfitting)
  - OR LightGBM with few trees and heavy regularization

Level 2 (optional):
  - Simple average of Level 1 predictions

2. Feature-Based Diversity

python
# Train different models on different feature subsets
features_v1 = X[:, :15]   # First half of features
features_v2 = X[:, 15:]   # Second half
features_v3 = X            # All features

# Each model sees different information -> different errors

3. Seed Averaging

The cheapest ensemble — same model, different random seeds:

python
from sklearn.ensemble import GradientBoostingClassifier

predictions = []
for seed in range(10):
    gb = GradientBoostingClassifier(n_estimators=100, random_state=seed)
    gb.fit(X, y)
    predictions.append(gb.predict_proba(X)[:, 1])

ensemble_pred = np.mean(predictions, axis=0)
# Reduces variance with zero effort

4. The Power of Simple Averaging

python
# Often, a simple average of good models beats complex stacking
def weighted_average_ensemble(predictions, weights=None):
    """Weighted average of model predictions."""
    if weights is None:
        weights = np.ones(len(predictions)) / len(predictions)
    return np.average(predictions, axis=0, weights=weights)

Comparison Summary

MethodReducesTrainingDiversity SourceRisk
BaggingVarianceParallelBootstrap samplingLow overfitting
Random ForestVarianceParallelBootstrap + feature subsamplingLow overfitting
AdaBoostBiasSequentialReweighting samplesCan overfit noisy data
Gradient BoostingBias + VarianceSequentialFitting residualsCan overfit
StackingBothTwo-stageDifferent algorithm familiesMeta-learner overfitting
VotingVarianceParallelDifferent algorithmsWeakest model may hurt
Seed averagingVarianceParallelRandom initializationMinimal benefit

Key Takeaways

ConceptRemember
Ensemble error = avg error - diversityDiversity is as important as individual accuracy
Bagging reduces variance, boosting reduces biasChoose based on your error diagnosis
Random Forest decorrelates trees via feature subsamplingmtry=d for classification
Gradient boosting fits residuals sequentiallyLearning rate controls the contribution of each tree
Stacking uses meta-learner on base predictionsOut-of-fold predictions prevent leakage
Correlation between models limits ensemble gainVarensρσ2 as M
Simple averaging often beats complex stackingStart simple, add complexity only if it helps
Kaggle winners stack 10+ diverse modelsBut production systems favor single models for simplicity

"What I cannot create, I do not understand." — Richard Feynman