Skip to content
Unverified — AI-generated content. Help verify this page

Evaluation Metrics

Choosing the wrong metric is as dangerous as choosing the wrong model. A model that achieves 99% accuracy on an imbalanced dataset may be useless. A regression model with low RMSE may still miss the trend. This page covers every metric you need, with full math and guidance on when to use each.


Classification Metrics

The Confusion Matrix

All binary classification metrics derive from four counts:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP)False Negative (FN)
Actually NegativeFalse Positive (FP)True Negative (TN)
python
# confusion_matrix.py — Building the confusion matrix
import numpy as np
from sklearn.metrics import confusion_matrix

y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0])
y_pred = np.array([1, 1, 0, 0, 0, 0, 0, 0, 1, 0])

cm = confusion_matrix(y_true, y_pred)
tn, fp, fn, tp = cm.ravel()

print(f"True Positives  (TP): {tp}")
print(f"False Positives (FP): {fp}")
print(f"False Negatives (FN): {fn}")
print(f"True Negatives  (TN): {tn}")

Accuracy

Accuracy=TP+TNTP+TN+FP+FN

When to use: Balanced classes only. When NOT to use: Imbalanced data — if 99% negative, predicting all negative = 99% accuracy but 0% detection.

Precision

Precision=TPTP+FP

"Of all the items I predicted positive, how many were actually positive?"

When to use: When false positives are costly (spam filter — do not put legitimate email in spam).

Recall (Sensitivity, True Positive Rate)

Recall=TPTP+FN

"Of all the actually positive items, how many did I catch?"

When to use: When false negatives are costly (disease screening — do not miss a sick patient).

F1 Score

The harmonic mean of precision and recall:

F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FN
Worked Example — Precision, Recall, and F1

Confusion matrix for a disease screening test (200 patients):

Predicted SickPredicted Healthy
Actually SickTP = 40FN = 10
Actually HealthyFP = 20TN = 130

Step 1: Accuracy Accuracy = (40 + 130) / (40 + 130 + 20 + 10) = 170/200 = 0.850

Step 2: Precision ("Of those I flagged as sick, how many truly are?") Precision = TP / (TP + FP) = 40 / (40 + 20) = 40/60 = 0.667

Step 3: Recall ("Of all sick patients, how many did I catch?") Recall = TP / (TP + FN) = 40 / (40 + 10) = 40/50 = 0.800

Step 4: F1 Score F1 = 2 * (0.667 * 0.800) / (0.667 + 0.800) = 2 * 0.534 / 1.467 = 1.067 / 1.467 = 0.727

Interpret: "The model catches 80% of sick patients (good recall) but 33% of its positive predictions are wrong (precision = 66.7%). The F1 of 0.727 balances both. For disease screening, we'd want higher recall even at the cost of precision."

Why harmonic mean? It penalizes extreme imbalances. If precision = 1.0 and recall = 0.01, then:

  • Arithmetic mean = 0.505 (misleadingly high)
  • Harmonic mean = 0.0198 (correctly low)

F-beta Score

Generalization that lets you weight precision vs recall:

Fβ=(1+β2)PrecisionRecallβ2Precision+Recall
  • β=1: Standard F1 (equal weight)
  • β=0.5: Weights precision more
  • β=2: Weights recall more
python
# classification_metrics.py — All classification metrics
import numpy as np
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    fbeta_score, matthews_corrcoef, log_loss, roc_auc_score,
    average_precision_score, classification_report
)

# Simulated predictions
np.random.seed(42)
y_true = np.array([1]*100 + [0]*900)  # 10% positive
y_pred = np.random.choice([0, 1], 1000, p=[0.85, 0.15])
y_proba = np.random.beta(2, 5, 1000)  # probability estimates
y_proba[y_true == 1] += 0.3  # make positive class have higher probabilities
y_proba = np.clip(y_proba, 0, 1)

print("=== Classification Metrics ===\n")
print(f"Accuracy:           {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision:          {precision_score(y_true, y_pred):.4f}")
print(f"Recall:             {recall_score(y_true, y_pred):.4f}")
print(f"F1 Score:           {f1_score(y_true, y_pred):.4f}")
print(f"F0.5 (prec-heavy):  {fbeta_score(y_true, y_pred, beta=0.5):.4f}")
print(f"F2 (recall-heavy):  {fbeta_score(y_true, y_pred, beta=2):.4f}")
print(f"MCC:                {matthews_corrcoef(y_true, y_pred):.4f}")
print(f"ROC-AUC:            {roc_auc_score(y_true, y_proba):.4f}")
print(f"PR-AUC:             {average_precision_score(y_true, y_proba):.4f}")
print(f"Log Loss:           {log_loss(y_true, y_proba):.4f}")

Matthews Correlation Coefficient (MCC)

MCC uses all four confusion matrix values and works well even with imbalanced data:

MCC=TPTNFPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)
Worked Example — MCC

Using the disease screening confusion matrix (TP=40, TN=130, FP=20, FN=10):

Step 1: Compute numerator TPTN - FPFN = 40130 - 2010 = 5200 - 200 = 5000

Step 2: Compute denominator (TP+FP) = 60, (TP+FN) = 50, (TN+FP) = 150, (TN+FN) = 140 sqrt(60 * 50 * 150 * 140) = sqrt(63,000,000) = 7937.3

Step 3: Compute MCC MCC = 5000 / 7937.3 = 0.630

Compare with "predict all healthy" model (TP=0, FP=0, FN=50, TN=150): Numerator = 0150 - 050 = 0 MCC = 0 (correctly identifies this model as useless!) But accuracy = 150/200 = 0.75 (misleadingly high)

Interpret: "MCC of 0.630 indicates a moderately good classifier. Unlike accuracy (0.85), MCC correctly penalizes a model that just predicts the majority class."

  • MCC = +1: Perfect prediction
  • MCC = 0: Random prediction
  • MCC = -1: Total disagreement
python
# mcc.py — Why MCC is superior for imbalanced data
import numpy as np
from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef

# Scenario: 99% negative, model predicts all negative
y_true = np.array([0]*990 + [1]*10)
y_pred_all_neg = np.zeros(1000, dtype=int)

print("Model: Predict all negative (on 99% negative data)")
print(f"  Accuracy: {accuracy_score(y_true, y_pred_all_neg):.4f}")  # 0.99 — misleading!
print(f"  F1:       {f1_score(y_true, y_pred_all_neg):.4f}")        # 0.00 — correct
print(f"  MCC:      {matthews_corrcoef(y_true, y_pred_all_neg):.4f}")  # 0.00 — correct

# Model that catches 8 out of 10 positives
y_pred_good = np.zeros(1000, dtype=int)
y_pred_good[990:998] = 1  # catches 8 positives
y_pred_good[50:60] = 1    # 10 false positives

print("\nModel: Catches 8/10 positives with 10 false positives")
print(f"  Accuracy: {accuracy_score(y_true, y_pred_good):.4f}")
print(f"  F1:       {f1_score(y_true, y_pred_good):.4f}")
print(f"  MCC:      {matthews_corrcoef(y_true, y_pred_good):.4f}")

ROC-AUC

The Receiver Operating Characteristic curve plots True Positive Rate vs False Positive Rate at all thresholds.

TPR=TPTP+FN,FPR=FPFP+TN

AUC (Area Under the ROC Curve) measures the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example.

  • AUC = 1.0: Perfect
  • AUC = 0.5: Random
  • AUC < 0.5: Worse than random (flip predictions)
python
# roc_auc.py — ROC curve and AUC
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_curve, auc

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

models = {
    'Logistic Regression': make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
}

plt.figure(figsize=(8, 6))

for name, model in models.items():
    model.fit(X_train, y_train)
    y_proba = model.predict_proba(X_test)[:, 1]
    fpr, tpr, thresholds = roc_curve(y_test, y_proba)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC={roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC=0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curves.png', dpi=150)
plt.show()

PR-AUC (Precision-Recall AUC)

For imbalanced datasets, PR-AUC is more informative than ROC-AUC. ROC-AUC can be misleadingly high when negatives vastly outnumber positives.

python
# pr_auc.py — Precision-Recall curve
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, auc, average_precision_score

# Imbalanced dataset
X, y = make_classification(n_samples=5000, weights=[0.95, 0.05], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_proba = model.predict_proba(X_test)[:, 1]

precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
pr_auc = auc(recall, precision)
ap = average_precision_score(y_test, y_proba)

print(f"PR-AUC: {pr_auc:.4f}")
print(f"Average Precision: {ap:.4f}")

baseline = y_test.mean()
print(f"Baseline (random): {baseline:.4f}")

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'b-', linewidth=2, label=f'Model (AP={ap:.3f})')
plt.axhline(y=baseline, color='r', linestyle='--', label=f'Baseline ({baseline:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pr_curve.png', dpi=150)
plt.show()

Log Loss (Binary Cross-Entropy)

Log Loss=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]
Worked Example — Log Loss

4 predictions with true labels and predicted probabilities:

Sampley_truep_hatLoss contribution
110.90-1*log(0.90) = 0.105
200.20-1*log(0.80) = 0.223
310.60-1*log(0.60) = 0.511
400.95-1*log(0.05) = 2.996

Step 1: Sum contributions = 0.105 + 0.223 + 0.511 + 2.996 = 3.835

Step 2: Average Log Loss = 3.835 / 4 = 0.959

Step 3: Notice sample 4 dominates the loss Sample 4 alone contributes 2.996 — it predicted 95% positive but the true label is negative (confident and wrong!).

Interpret: "Log loss = 0.959 is high, mainly because sample 4 was confidently wrong (p=0.95 for a negative). Log loss harshly penalizes confident wrong predictions: being 95% sure and wrong costs 2.996, while being 90% sure and right only contributes 0.105."

Log loss measures the quality of probability estimates, not just binary predictions. It penalizes confident wrong predictions heavily.

python
# log_loss_demo.py — Why log loss matters for probability calibration
import numpy as np
from sklearn.metrics import log_loss

# True label: positive (1)
y_true = [1]

# Confident and correct: low loss
print(f"P(1)=0.99: log_loss = {log_loss(y_true, [[0.01, 0.99]]):.4f}")

# Slightly confident: moderate loss
print(f"P(1)=0.70: log_loss = {log_loss(y_true, [[0.30, 0.70]]):.4f}")

# Uncertain: higher loss
print(f"P(1)=0.51: log_loss = {log_loss(y_true, [[0.49, 0.51]]):.4f}")

# Confident and WRONG: very high loss
print(f"P(1)=0.01: log_loss = {log_loss(y_true, [[0.99, 0.01]]):.4f}")

Regression Metrics

Mean Squared Error (MSE)

MSE=1ni=1n(yiy^i)2

Properties:

  • Always non-negative
  • Penalizes large errors quadratically
  • Units are target-units-squared

Root Mean Squared Error (RMSE)

RMSE=MSE=1ni=1n(yiy^i)2

Properties:

  • Same units as target variable
  • Sensitive to outliers (due to squaring)

Mean Absolute Error (MAE)

MAE=1ni=1n|yiy^i|

Properties:

  • Same units as target
  • More robust to outliers than RMSE
  • Linear penalty for all errors

R-Squared (R2)

R2=1i=1n(yiy^i)2i=1n(yiy¯)2=1SSresSStot
Worked Example — R-Squared, MSE, RMSE, MAE

Mini regression results: | y_true | y_pred | error | |error| | error^2 | |--------|--------|-------|---------|---------| | 10 | 12 | -2 | 2 | 4 | | 20 | 18 | 2 | 2 | 4 | | 30 | 33 | -3 | 3 | 9 | | 40 | 39 | 1 | 1 | 1 | | 50 | 48 | 2 | 2 | 4 |

y_bar = (10+20+30+40+50)/5 = 30

Step 1: MSE = (4+4+9+1+4)/5 = 22/5 = 4.40 Step 2: RMSE = sqrt(4.40) = 2.10 Step 3: MAE = (2+2+3+1+2)/5 = 10/5 = 2.00

Step 4: R-squared SS_res = 4+4+9+1+4 = 22 SS_tot = (10-30)^2 + (20-30)^2 + (30-30)^2 + (40-30)^2 + (50-30)^2 = 400 + 100 + 0 + 100 + 400 = 1000 R^2 = 1 - 22/1000 = 1 - 0.022 = 0.978

Interpret: "R^2 = 0.978 means the model explains 97.8% of the variance in y. RMSE = 2.10 means predictions are off by about 2.1 units on average."

Properties:

  • R2=1: Perfect prediction
  • R2=0: Model is as good as predicting the mean
  • R2<0: Model is worse than predicting the mean
  • Can be arbitrarily negative for bad models

Derivation:SStot is the total variance in y. SSres is the unexplained variance. R2 is the fraction of variance explained by the model.

Adjusted R-Squared

Standard R2 always increases (or stays the same) when you add features — even useless ones. Adjusted R2 penalizes complexity:

Radj2=1(1R2)(n1)nd1

where n is the number of samples and d is the number of features.

Worked Example — Adjusted R-Squared

Model with n=50 samples, d=5 features, R^2 = 0.80

Step 1: Compute adjusted R^2 R^2_adj = 1 - (1 - 0.80)(50 - 1) / (50 - 5 - 1) = 1 - (0.20)(49) / 44 = 1 - 9.8/44 = 1 - 0.2227 = 0.777

Step 2: Compare: what if we add 10 useless features (d=15, R^2 stays 0.80)? R^2_adj = 1 - (0.20)(49) / (50 - 15 - 1) = 1 - 9.8/34 = 1 - 0.288 = 0.712

Step 3: What if adding those features improves R^2 to 0.82? R^2_adj = 1 - (0.18)(49) / 34 = 1 - 8.82/34 = 1 - 0.259 = 0.741

Interpret: "Original model: R^2=0.80, R^2_adj=0.777. Adding 10 useless features keeps R^2 at 0.80 but drops R^2_adj to 0.712 (penalizing complexity). Even if R^2 improves slightly to 0.82, R^2_adj is only 0.741 — still lower than the simpler model. Adjusted R^2 correctly favors the simpler model."

python
# regression_metrics.py — All regression metrics
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    mean_absolute_percentage_error, median_absolute_error
)

housing = fetch_california_housing()
X, y = housing.data, housing.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = GradientBoostingRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)
med_ae = median_absolute_error(y_test, y_pred)

# Adjusted R²
n = len(y_test)
d = X_test.shape[1]
r2_adj = 1 - (1 - r2) * (n - 1) / (n - d - 1)

print("=== Regression Metrics ===\n")
print(f"MSE:          {mse:.4f}")
print(f"RMSE:         {rmse:.4f}")
print(f"MAE:          {mae:.4f}")
print(f"Median AE:    {med_ae:.4f}")
print(f"MAPE:         {mape:.4f} ({mape*100:.2f}%)")
print(f"R²:           {r2:.4f}")
print(f"Adjusted R²:  {r2_adj:.4f}")

# Baseline comparison
baseline_pred = np.full_like(y_test, y_train.mean())
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_pred))
print(f"\nBaseline RMSE (predict mean): {baseline_rmse:.4f}")
print(f"Model improves by: {(1 - rmse/baseline_rmse)*100:.1f}%")

RMSE vs MAE: When to Use Which

python
# rmse_vs_mae.py — Sensitivity to outliers
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error

# True values
y_true = np.array([10, 20, 30, 40, 50])

# Predictions with one outlier error
y_pred_good = np.array([11, 19, 31, 39, 51])  # all close
y_pred_outlier = np.array([11, 19, 31, 39, 80])  # one big error on last

for name, y_pred in [("Good", y_pred_good), ("One outlier", y_pred_outlier)]:
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    mae = mean_absolute_error(y_true, y_pred)
    print(f"{name:12s}: RMSE={rmse:.2f}, MAE={mae:.2f}, RMSE/MAE={rmse/mae:.2f}")

# RMSE/MAE > 1 always. The ratio increases with outliers.
# If all errors equal: RMSE/MAE = 1
# As outlier errors increase: RMSE/MAE increases

When to Use Which Metric

Classification

MetricUse WhenAvoid When
AccuracyBalanced classesImbalanced data
PrecisionFP is costly (spam filter)FN is more important
RecallFN is costly (disease detection)FP is more important
F1Need balance of precision/recallOne matters much more
F2Recall is 2x more importantPrecision matters more
ROC-AUCComparing models, balancedHeavily imbalanced data
PR-AUCImbalanced dataBalanced data (use ROC-AUC)
MCCImbalanced data, single numberNeed threshold-free metric
Log LossNeed calibrated probabilitiesOnly need class labels

Regression

MetricUse WhenProperties
MSEStandard loss functionPenalizes large errors
RMSEWant interpretable unitsSensitive to outliers
MAERobust to outliers neededLinear penalty
MAPERelative error mattersFails when y near zero
R-squaredCompare to baselineCan be negative
Adjusted R-squaredComparing models with different featuresPenalizes complexity

Multi-Class Metrics

Averaging Strategies

For K classes, precision/recall/F1 are computed per-class, then averaged:

StrategyFormulaWhen
Macro1Kk=1KmetrickAll classes equally important
Weightedk=1KnknmetrickProportional to class size
MicroCompute from total TP, FP, FNEquivalent to accuracy
python
# multiclass_metrics.py — Multi-class evaluation
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, accuracy_score,
    precision_score, recall_score, f1_score
)
import numpy as np

iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Per-class report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

for avg in ['macro', 'weighted', 'micro']:
    p = precision_score(y_test, y_pred, average=avg)
    r = recall_score(y_test, y_pred, average=avg)
    f = f1_score(y_test, y_pred, average=avg)
    print(f"{avg:>10}: P={p:.4f}, R={r:.4f}, F1={f:.4f}")

Metric Selection Framework

python
# metric_selection.py — Decision tree for metric selection
def recommend_metric(task, class_balance, cost_priority, need_probabilities):
    """Recommend the best metric based on problem characteristics."""

    if task == 'regression':
        print("Regression metrics:")
        print("  Primary: RMSE (interpretable units)")
        print("  Also report: MAE (robust to outliers), R² (vs baseline)")
        return

    # Classification
    if class_balance == 'balanced':
        if need_probabilities:
            metric = "Log Loss (for probability quality) + ROC-AUC"
        else:
            metric = "F1-Score (macro for multi-class)"
    else:  # imbalanced
        if cost_priority == 'false_negatives':
            metric = "Recall + F2-Score + PR-AUC"
        elif cost_priority == 'false_positives':
            metric = "Precision + F0.5-Score"
        else:
            metric = "MCC + PR-AUC (most robust for imbalanced)"

    print(f"Recommended: {metric}")

# Examples
print("=== Disease screening ===")
recommend_metric('classification', 'imbalanced', 'false_negatives', True)

print("\n=== Spam filter ===")
recommend_metric('classification', 'imbalanced', 'false_positives', True)

print("\n=== House price prediction ===")
recommend_metric('regression', 'n/a', 'n/a', False)

print("\n=== Balanced binary classification ===")
recommend_metric('classification', 'balanced', 'equal', True)

Cross-Validated Evaluation

python
# cv_evaluation.py — Robust evaluation with cross-validation
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate, StratifiedKFold
import numpy as np

data = load_breast_cancer()
X, y = data.data, data.target

cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc',
    'mcc': 'matthews_corrcoef',
}

results = cross_validate(model, X, y, cv=cv, scoring=scoring)

print(f"{'Metric':<15} {'Mean':>10} {'Std':>8} {'Min':>10} {'Max':>10}")
print("-" * 55)
for metric in scoring:
    values = results[f'test_{metric}']
    print(f"{metric:<15} {values.mean():>10.4f} {values.std():>8.4f} "
          f"{values.min():>10.4f} {values.max():>10.4f}")

Quick Reference

Classification

MetricFormulaRangeHigher is Better
Accuracy(TP+TN)/(TP+TN+FP+FN)[0,1]Yes
PrecisionTP/(TP+FP)[0,1]Yes
RecallTP/(TP+FN)[0,1]Yes
F12PRP+R[0,1]Yes
ROC-AUCArea under ROC[0,1]Yes
PR-AUCArea under PR curve[0,1]Yes
MCCTPTNFPFN[1,1]Yes
Log Loss1nylogp^[0,)No (lower)

Regression

MetricFormulaRangeLower is Better
MSE1n(yy^)2[0,)Yes
RMSEMSE[0,)Yes
MAE$\frac{1}{n}\sumy-\hat$
R-squared1SSres/SStot(,1]No (higher)
MAPE$\frac{1}{n}\sumy-\hat/

Further Reading

"What I cannot create, I do not understand." — Richard Feynman