Evaluation Metrics

Choosing the wrong metric is as dangerous as choosing the wrong model. A model that achieves 99% accuracy on an imbalanced dataset may be useless. A regression model with low RMSE may still miss the trend. This page covers every metric you need, with full math and guidance on when to use each.

Classification Metrics

The Confusion Matrix

All binary classification metrics derive from four counts:

	Predicted Positive	Predicted Negative
Actually Positive	True Positive (TP)	False Negative (FN)
Actually Negative	False Positive (FP)	True Negative (TN)

python

# confusion_matrix.py — Building the confusion matrix
import numpy as np
from sklearn.metrics import confusion_matrix

y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
y_pred = np.array([1, 1, 0, 0, 0, 0, 0, 0, 1, 0])

cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"True Positives  (TP): {tp}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")
print(f"True Negatives  (TN): {tn}")

Accuracy

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

When to use: Balanced classes only. When NOT to use: Imbalanced data — if 99% negative, predicting all negative = 99% accuracy but 0% detection.

Precision

Precision = \frac{T P}{T P + F P}

"Of all the items I predicted positive, how many were actually positive?"

When to use: When false positives are costly (spam filter — do not put legitimate email in spam).

Recall (Sensitivity, True Positive Rate)

Recall = \frac{T P}{T P + F N}

"Of all the actually positive items, how many did I catch?"

When to use: When false negatives are costly (disease screening — do not miss a sick patient).

F1 Score

The harmonic mean of precision and recall:

F_{1} = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} = \frac{2 T P}{2 T P + F P + F N}

Worked Example — Precision, Recall, and F1

Confusion matrix for a disease screening test (200 patients):

	Predicted Sick	Predicted Healthy
Actually Sick	TP = 40	FN = 10
Actually Healthy	FP = 20	TN = 130

Step 1: Accuracy Accuracy = (40 + 130) / (40 + 130 + 20 + 10) = 170/200 = 0.850

Step 2: Precision ("Of those I flagged as sick, how many truly are?") Precision = TP / (TP + FP) = 40 / (40 + 20) = 40/60 = 0.667

Step 3: Recall ("Of all sick patients, how many did I catch?") Recall = TP / (TP + FN) = 40 / (40 + 10) = 40/50 = 0.800

Step 4: F1 Score F1 = 2 * (0.667 * 0.800) / (0.667 + 0.800) = 2 * 0.534 / 1.467 = 1.067 / 1.467 = 0.727

Interpret: "The model catches 80% of sick patients (good recall) but 33% of its positive predictions are wrong (precision = 66.7%). The F1 of 0.727 balances both. For disease screening, we'd want higher recall even at the cost of precision."

Why harmonic mean? It penalizes extreme imbalances. If precision = 1.0 and recall = 0.01, then:

Arithmetic mean = 0.505 (misleadingly high)
Harmonic mean = 0.0198 (correctly low)

F-beta Score

Generalization that lets you weight precision vs recall:

F_{β} = (1 + β^{2}) \cdot \frac{Precision \cdot Recall}{β^{2} \cdot Precision + Recall}

$β = 1$ : Standard F1 (equal weight)
$β = 0.5$ : Weights precision more
$β = 2$ : Weights recall more

python

# classification_metrics.py — All classification metrics
import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    fbeta_score, matthews_corrcoef, log_loss, roc_auc_score,
    average_precision_score, classification_report
)

# Simulated predictions
np.random.seed(42)
y_true = np.array([1]*100 + [0]*900)  # 10% positive
y_pred = np.random.choice([0, 1], 1000, p=[0.85, 0.15])
y_proba = np.random.beta(2, 5, 1000)  # probability estimates
y_proba[y_true == 1] += 0.3  # make positive class have higher probabilities
y_proba = np.clip(y_proba, 0, 1)

print("=== Classification Metrics ===\n")
print(f"Accuracy:           {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision:          {precision_score(y_true, y_pred):.4f}")
print(f"Recall:             {recall_score(y_true, y_pred):.4f}")
print(f"F1 Score:           {f1_score(y_true, y_pred):.4f}")
print(f"F0.5 (prec-heavy):  {fbeta_score(y_true, y_pred, beta=0.5):.4f}")
print(f"F2 (recall-heavy):  {fbeta_score(y_true, y_pred, beta=2):.4f}")
print(f"MCC:                {matthews_corrcoef(y_true, y_pred):.4f}")
print(f"ROC-AUC:            {roc_auc_score(y_true, y_proba):.4f}")
print(f"PR-AUC:             {average_precision_score(y_true, y_proba):.4f}")
print(f"Log Loss:           {log_loss(y_true, y_proba):.4f}")

Matthews Correlation Coefficient (MCC)

MCC uses all four confusion matrix values and works well even with imbalanced data:

MCC = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

Worked Example — MCC

Using the disease screening confusion matrix (TP=40, TN=130, FP=20, FN=10):

Step 1: Compute numerator TPTN - FPFN = 40130 - 2010 = 5200 - 200 = 5000

Step 2: Compute denominator (TP+FP) = 60, (TP+FN) = 50, (TN+FP) = 150, (TN+FN) = 140 sqrt(60 * 50 * 150 * 140) = sqrt(63,000,000) = 7937.3

Step 3: Compute MCC MCC = 5000 / 7937.3 = 0.630

Compare with "predict all healthy" model (TP=0, FP=0, FN=50, TN=150): Numerator = 0150 - 050 = 0 MCC = 0 (correctly identifies this model as useless!) But accuracy = 150/200 = 0.75 (misleadingly high)

Interpret: "MCC of 0.630 indicates a moderately good classifier. Unlike accuracy (0.85), MCC correctly penalizes a model that just predicts the majority class."

MCC = +1: Perfect prediction
MCC = 0: Random prediction
MCC = -1: Total disagreement

python

# mcc.py — Why MCC is superior for imbalanced data
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef

# Scenario: 99% negative, model predicts all negative
y_true = np.array([0]*990 + [1]*10)
y_pred_all_neg = np.zeros(1000, dtype=int)

print("Model: Predict all negative (on 99% negative data)")
print(f"  Accuracy: {accuracy_score(y_true, y_pred_all_neg):.4f}")  # 0.99 — misleading!
print(f"  F1:       {f1_score(y_true, y_pred_all_neg):.4f}")        # 0.00 — correct
print(f"  MCC:      {matthews_corrcoef(y_true, y_pred_all_neg):.4f}")  # 0.00 — correct

# Model that catches 8 out of 10 positives
y_pred_good = np.zeros(1000, dtype=int)
y_pred_good[990:998] = 1  # catches 8 positives
y_pred_good[50:60] = 1    # 10 false positives

print("\nModel: Catches 8/10 positives with 10 false positives")
print(f"  Accuracy: {accuracy_score(y_true, y_pred_good):.4f}")
print(f"  F1:       {f1_score(y_true, y_pred_good):.4f}")
print(f"  MCC:      {matthews_corrcoef(y_true, y_pred_good):.4f}")

ROC-AUC

The Receiver Operating Characteristic curve plots True Positive Rate vs False Positive Rate at all thresholds.

TPR = \frac{T P}{T P + F N}, FPR = \frac{F P}{F P + T N}

AUC (Area Under the ROC Curve) measures the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.

AUC = 1.0: Perfect
AUC = 0.5: Random
AUC < 0.5: Worse than random (flip predictions)

python

# roc_auc.py — ROC curve and AUC
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_curve, auc

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

models = {
    'Logistic Regression': make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
}

plt.figure(figsize=(8, 6))

for name, model in models.items():
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curves.png', dpi=150)
plt.show()

PR-AUC (Precision-Recall AUC)

For imbalanced datasets, PR-AUC is more informative than ROC-AUC. ROC-AUC can be misleadingly high when negatives vastly outnumber positives.

python

# pr_auc.py — Precision-Recall curve
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, auc, average_precision_score

# Imbalanced dataset
X, y = make_classification(n_samples=5000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
pr_auc = auc(recall, precision)
ap = average_precision_score(y_test, y_proba)

print(f"PR-AUC: {pr_auc:.4f}")
print(f"Average Precision: {ap:.4f}")

baseline = y_test.mean()
print(f"Baseline (random): {baseline:.4f}")

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'Model (AP={ap:.3f})')
plt.axhline(y=baseline, color='r', linestyle='--', label=f'Baseline ({baseline:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pr_curve.png', dpi=150)
plt.show()

Log Loss (Binary Cross-Entropy)

Log Loss = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} \log ({\hat{p}}_{i}) + (1 - y_{i}) \log (1 - {\hat{p}}_{i})]

Worked Example — Log Loss

4 predictions with true labels and predicted probabilities:

Sample	y_true	p_hat	Loss contribution
1	1	0.90	-1*log(0.90) = 0.105
2	0	0.20	-1*log(0.80) = 0.223
3	1	0.60	-1*log(0.60) = 0.511
4	0	0.95	-1*log(0.05) = 2.996

Step 1: Sum contributions = 0.105 + 0.223 + 0.511 + 2.996 = 3.835

Step 2: Average Log Loss = 3.835 / 4 = 0.959

Step 3: Notice sample 4 dominates the loss Sample 4 alone contributes 2.996 — it predicted 95% positive but the true label is negative (confident and wrong!).

Interpret: "Log loss = 0.959 is high, mainly because sample 4 was confidently wrong (p=0.95 for a negative). Log loss harshly penalizes confident wrong predictions: being 95% sure and wrong costs 2.996, while being 90% sure and right only contributes 0.105."

Log loss measures the quality of probability estimates, not just binary predictions. It penalizes confident wrong predictions heavily.

python

# log_loss_demo.py — Why log loss matters for probability calibration
import numpy as np
from sklearn.metrics import log_loss

# True label: positive (1)
y_true = [1]

# Confident and correct: low loss
print(f"P(1)=0.99: log_loss = {log_loss(y_true, [[0.01, 0.99]]):.4f}")

# Slightly confident: moderate loss
print(f"P(1)=0.70: log_loss = {log_loss(y_true, [[0.30, 0.70]]):.4f}")

# Uncertain: higher loss
print(f"P(1)=0.51: log_loss = {log_loss(y_true, [[0.49, 0.51]]):.4f}")

# Confident and WRONG: very high loss
print(f"P(1)=0.01: log_loss = {log_loss(y_true, [[0.99, 0.01]]):.4f}")

Regression Metrics

Mean Squared Error (MSE)

MSE = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}

Properties:

Always non-negative
Penalizes large errors quadratically
Units are target-units-squared

Root Mean Squared Error (RMSE)

RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}

Properties:

Same units as target variable
Sensitive to outliers (due to squaring)

Mean Absolute Error (MAE)

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

Properties:

Same units as target
More robust to outliers than RMSE
Linear penalty for all errors

R-Squared ( $R^{2}$ )

R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}} = 1 - \frac{S S_{r e s}}{S S_{t o t}}

Worked Example — R-Squared, MSE, RMSE, MAE

Mini regression results: | y_true | y_pred | error | |error| | error^2 | |--------|--------|-------|---------|---------| | 10 | 12 | -2 | 2 | 4 | | 20 | 18 | 2 | 2 | 4 | | 30 | 33 | -3 | 3 | 9 | | 40 | 39 | 1 | 1 | 1 | | 50 | 48 | 2 | 2 | 4 |

y_bar = (10+20+30+40+50)/5 = 30

Step 1: MSE = (4+4+9+1+4)/5 = 22/5 = 4.40 Step 2: RMSE = sqrt(4.40) = 2.10 Step 3: MAE = (2+2+3+1+2)/5 = 10/5 = 2.00

Step 4: R-squared SS_res = 4+4+9+1+4 = 22 SS_tot = (10-30)^2 + (20-30)^2 + (30-30)^2 + (40-30)^2 + (50-30)^2 = 400 + 100 + 0 + 100 + 400 = 1000 R^2 = 1 - 22/1000 = 1 - 0.022 = 0.978

Interpret: "R^2 = 0.978 means the model explains 97.8% of the variance in y. RMSE = 2.10 means predictions are off by about 2.1 units on average."

Properties:

$R^{2} = 1$ : Perfect prediction
$R^{2} = 0$ : Model is as good as predicting the mean
$R^{2} < 0$ : Model is worse than predicting the mean
Can be arbitrarily negative for bad models

Derivation: $S S_{t o t}$ is the total variance in $y$ . $S S_{r e s}$ is the unexplained variance. $R^{2}$ is the fraction of variance explained by the model.

Adjusted R-Squared

Standard $R^{2}$ always increases (or stays the same) when you add features — even useless ones. Adjusted $R^{2}$ penalizes complexity:

R_{a d j}^{2} = 1 - \frac{(1 - R^{2}) (n - 1)}{n - d - 1}

where $n$ is the number of samples and $d$ is the number of features.

Worked Example — Adjusted R-Squared

Model with n=50 samples, d=5 features, R^2 = 0.80

Step 1: Compute adjusted R^2 R^2_adj = 1 - (1 - 0.80)(50 - 1) / (50 - 5 - 1) = 1 - (0.20)(49) / 44 = 1 - 9.8/44 = 1 - 0.2227 = 0.777

Step 2: Compare: what if we add 10 useless features (d=15, R^2 stays 0.80)? R^2_adj = 1 - (0.20)(49) / (50 - 15 - 1) = 1 - 9.8/34 = 1 - 0.288 = 0.712

Step 3: What if adding those features improves R^2 to 0.82? R^2_adj = 1 - (0.18)(49) / 34 = 1 - 8.82/34 = 1 - 0.259 = 0.741

Interpret: "Original model: R^2=0.80, R^2_adj=0.777. Adding 10 useless features keeps R^2 at 0.80 but drops R^2_adj to 0.712 (penalizing complexity). Even if R^2 improves slightly to 0.82, R^2_adj is only 0.741 — still lower than the simpler model. Adjusted R^2 correctly favors the simpler model."

python

# regression_metrics.py — All regression metrics
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error, median_absolute_error
)

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = GradientBoostingRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
med_ae = median_absolute_error(y_test, y_pred)

# Adjusted R²
n = len(y_test)
d = X_test.shape[1]
r2_adj = 1 - (1 - r2) * (n - 1) / (n - d - 1)

print("=== Regression Metrics ===\n")
print(f"MSE:          {mse:.4f}")
print(f"RMSE:         {rmse:.4f}")
print(f"MAE:          {mae:.4f}")
print(f"Median AE:    {med_ae:.4f}")
print(f"MAPE:         {mape:.4f} ({mape*100:.2f}%)")
print(f"R²:           {r2:.4f}")
print(f"Adjusted R²:  {r2_adj:.4f}")

# Baseline comparison
baseline_pred = np.full_like(y_test, y_train.mean())
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
print(f"\nBaseline RMSE (predict mean): {baseline_rmse:.4f}")
print(f"Model improves by: {(1 - rmse/baseline_rmse)*100:.1f}%")

RMSE vs MAE: When to Use Which

python

# rmse_vs_mae.py — Sensitivity to outliers
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error

# True values
y_true = np.array([10, 20, 30, 40, 50])

# Predictions with one outlier error
y_pred_good = np.array([11, 19, 31, 39, 51])  # all close
y_pred_outlier = np.array([11, 19, 31, 39, 80])  # one big error on last

for name, y_pred in [("Good", y_pred_good), ("One outlier", y_pred_outlier)]:
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    print(f"{name:12s}: RMSE={rmse:.2f}, MAE={mae:.2f}, RMSE/MAE={rmse/mae:.2f}")

# RMSE/MAE > 1 always. The ratio increases with outliers.
# If all errors equal: RMSE/MAE = 1
# As outlier errors increase: RMSE/MAE increases

When to Use Which Metric

Classification

Metric	Use When	Avoid When
Accuracy	Balanced classes	Imbalanced data
Precision	FP is costly (spam filter)	FN is more important
Recall	FN is costly (disease detection)	FP is more important
F1	Need balance of precision/recall	One matters much more
F2	Recall is 2x more important	Precision matters more
ROC-AUC	Comparing models, balanced	Heavily imbalanced data
PR-AUC	Imbalanced data	Balanced data (use ROC-AUC)
MCC	Imbalanced data, single number	Need threshold-free metric
Log Loss	Need calibrated probabilities	Only need class labels

Regression

Metric	Use When	Properties
MSE	Standard loss function	Penalizes large errors
RMSE	Want interpretable units	Sensitive to outliers
MAE	Robust to outliers needed	Linear penalty
MAPE	Relative error matters	Fails when $y$ near zero
R-squared	Compare to baseline	Can be negative
Adjusted R-squared	Comparing models with different features	Penalizes complexity

Multi-Class Metrics

Averaging Strategies

For $K$ classes, precision/recall/F1 are computed per-class, then averaged:

Strategy	Formula	When
Macro	$\frac{1}{K} \sum_{k = 1}^{K} {metric}_{k}$	All classes equally important
Weighted	$\sum_{k = 1}^{K} \frac{n_{k}}{n} {metric}_{k}$	Proportional to class size
Micro	Compute from total TP, FP, FN	Equivalent to accuracy

python

# multiclass_metrics.py — Multi-class evaluation
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, accuracy_score,
    precision_score, recall_score, f1_score
)
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Per-class report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

for avg in ['macro', 'weighted', 'micro']:
    p = precision_score(y_test, y_pred, average=avg)
    r = recall_score(y_test, y_pred, average=avg)
    f = f1_score(y_test, y_pred, average=avg)
    print(f"{avg:>10}: P={p:.4f}, R={r:.4f}, F1={f:.4f}")

Metric Selection Framework

python

# metric_selection.py — Decision tree for metric selection
def recommend_metric(task, class_balance, cost_priority, need_probabilities):
    """Recommend the best metric based on problem characteristics."""

    if task == 'regression':
        print("Regression metrics:")
        print("  Primary: RMSE (interpretable units)")
        print("  Also report: MAE (robust to outliers), R² (vs baseline)")
        return

    # Classification
    if class_balance == 'balanced':
        if need_probabilities:
            metric = "Log Loss (for probability quality) + ROC-AUC"
        else:
            metric = "F1-Score (macro for multi-class)"
    else:  # imbalanced
        if cost_priority == 'false_negatives':
            metric = "Recall + F2-Score + PR-AUC"
        elif cost_priority == 'false_positives':
            metric = "Precision + F0.5-Score"
        else:
            metric = "MCC + PR-AUC (most robust for imbalanced)"

    print(f"Recommended: {metric}")

# Examples
print("=== Disease screening ===")
recommend_metric('classification', 'imbalanced', 'false_negatives', True)

print("\n=== Spam filter ===")
recommend_metric('classification', 'imbalanced', 'false_positives', True)

print("\n=== House price prediction ===")
recommend_metric('regression', 'n/a', 'n/a', False)

print("\n=== Balanced binary classification ===")
recommend_metric('classification', 'balanced', 'equal', True)

Cross-Validated Evaluation

python

# cv_evaluation.py — Robust evaluation with cross-validation
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate, StratifiedKFold
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc',
    'mcc': 'matthews_corrcoef',
}

results = cross_validate(model, X, y, cv=cv, scoring=scoring)

print(f"{'Metric':<15} {'Mean':>10} {'Std':>8} {'Min':>10} {'Max':>10}")
print("-" * 55)
for metric in scoring:
    values = results[f'test_{metric}']
    print(f"{metric:<15} {values.mean():>10.4f} {values.std():>8.4f} "
          f"{values.min():>10.4f} {values.max():>10.4f}")

Quick Reference

Classification

Metric	Formula	Range	Higher is Better
Accuracy	$(T P + T N) / (T P + T N + F P + F N)$	$[0, 1]$	Yes
Precision	$T P / (T P + F P)$	$[0, 1]$	Yes
Recall	$T P / (T P + F N)$	$[0, 1]$	Yes
F1	$2 \cdot \frac{P \cdot R}{P + R}$	$[0, 1]$	Yes
ROC-AUC	Area under ROC	$[0, 1]$	Yes
PR-AUC	Area under PR curve	$[0, 1]$	Yes
MCC	$\frac{T P \cdot T N - F P \cdot F N}{\sqrt{\dots}}$	$[- 1, 1]$	Yes
Log Loss	$- \frac{1}{n} \sum y \log \hat{p}$	$[0, \infty)$	No (lower)

Regression

Metric	Formula	Range	Lower is Better
MSE	$\frac{1}{n} \sum (y - \hat{y})^{2}$	$[0, \infty)$	Yes
RMSE	$\sqrt{M S E}$	$[0, \infty)$	Yes
MAE	$\frac{1}{n}\sum	y-\hat	$
R-squared	$1 - S S_{r e s} / S S_{t o t}$	$(- \infty, 1]$	No (higher)
MAPE	$\frac{1}{n}\sum	y-\hat	/

Evaluation Metrics ​

Classification Metrics ​

The Confusion Matrix ​

Accuracy ​

Precision ​

Recall (Sensitivity, True Positive Rate) ​

F1 Score ​

F-beta Score ​

Matthews Correlation Coefficient (MCC) ​

ROC-AUC ​

PR-AUC (Precision-Recall AUC) ​

Log Loss (Binary Cross-Entropy) ​

Regression Metrics ​

Mean Squared Error (MSE) ​

Root Mean Squared Error (RMSE) ​

Mean Absolute Error (MAE) ​

R-Squared (R2) ​

Adjusted R-Squared ​

RMSE vs MAE: When to Use Which ​

When to Use Which Metric ​

Classification ​

Regression ​

Multi-Class Metrics ​

Averaging Strategies ​

Metric Selection Framework ​

Cross-Validated Evaluation ​

Quick Reference ​

Classification ​

Regression ​

Further Reading ​

Related Pages

Evaluation Metrics

Classification Metrics

The Confusion Matrix

Accuracy

Precision

Recall (Sensitivity, True Positive Rate)

F1 Score

F-beta Score

Matthews Correlation Coefficient (MCC)

ROC-AUC

PR-AUC (Precision-Recall AUC)

Log Loss (Binary Cross-Entropy)

Regression Metrics

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

R-Squared ( $R^{2}$ )

Adjusted R-Squared

RMSE vs MAE: When to Use Which

When to Use Which Metric

Classification

Regression

Multi-Class Metrics

Averaging Strategies

Metric Selection Framework

Cross-Validated Evaluation

Quick Reference

Classification

Regression

Further Reading