upgrade
upgrade

🤖Statistical Prediction

Model Evaluation Metrics

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Choosing the right evaluation metric is one of the most consequential decisions you'll make in any machine learning project—and it's exactly what you're being tested on. The metric you select shapes how your model learns, what errors it prioritizes, and whether your predictions actually solve the problem at hand. A model optimized for accuracy might fail catastrophically on imbalanced data, while one tuned for precision could miss critical positive cases. Understanding these trade-offs is fundamental to model selection, hyperparameter tuning, and communicating results.

This guide organizes metrics by their underlying purpose: measuring prediction error magnitude, explaining variance, evaluating classification decisions, and comparing models. Don't just memorize formulas—know when each metric is appropriate, what its limitations are, and how different metrics can tell contradictory stories about the same model. Exam questions often present scenarios where you must justify your choice of metric or interpret conflicting results.


Error Magnitude Metrics (Regression)

These metrics quantify how far your predictions deviate from actual values. The key distinction is how they weight errors—squared terms amplify large errors, while absolute terms treat all errors equally.

Mean Squared Error (MSE)

  • Formula: MSE=1ni=1n(yiy^i)2MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2—averages the squared differences between predicted and actual values
  • Outlier sensitivity is the defining characteristic; squaring errors means one large miss can dominate the metric
  • Optimization-friendly because the squared term is differentiable everywhere, making it the default loss function for many algorithms

Root Mean Squared Error (RMSE)

  • Formula: RMSE=MSERMSE = \sqrt{MSE}—transforms MSE back to the original units of the target variable
  • Interpretability advantage over MSE; if predicting home prices in dollars, RMSE is also in dollars
  • Same outlier sensitivity as MSE—taking the square root doesn't change which errors dominate, just the scale

Mean Absolute Error (MAE)

  • Formula: MAE=1ni=1nyiy^iMAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|—averages the absolute differences without squaring
  • Robust to outliers because large errors aren't amplified; useful when extreme values are noise rather than signal
  • Median-like behavior—optimizing for MAE pushes predictions toward the conditional median rather than mean

Mean Absolute Percentage Error (MAPE)

  • Formula: MAPE=100%ni=1nyiy^iyiMAPE = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|—expresses error as a percentage of actual values
  • Scale-independent interpretation makes it useful for comparing models across different datasets or units
  • Undefined when yi=0y_i = 0 and asymmetric—overestimates are penalized differently than underestimates of the same magnitude

Compare: MSE/RMSE vs. MAE—both measure prediction error, but MSE/RMSE penalize large errors quadratically while MAE treats all errors linearly. If an FRQ asks which metric to use when outliers represent genuine phenomena (not noise), choose RMSE; if outliers are measurement errors, choose MAE.


Variance Explanation Metrics (Regression)

These metrics tell you how much of the target variable's variability your model captures. They answer a different question than error metrics: not "how wrong are predictions?" but "how much pattern did we find?"

R-squared (Coefficient of Determination)

  • Formula: R2=1SSresSStot=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}—proportion of variance explained by the model
  • Range typically 0 to 1 for in-sample data, where 1 means perfect prediction and 0 means no better than predicting the mean
  • Always increases with more predictors, even useless ones—this is the fundamental limitation that motivates adjusted versions

Adjusted R-squared

  • Formula: Radj2=1(1R2)(n1)np1R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}—penalizes model complexity by accounting for number of predictors pp
  • Can decrease when adding predictors that don't improve fit enough to justify their inclusion
  • Essential for model comparison when candidate models have different numbers of features; raw R2R^2 will always favor the more complex model

Compare: R2R^2 vs. Adjusted R2R^2—both measure variance explained, but only Adjusted R2R^2 accounts for model complexity. When comparing a 3-predictor model to a 10-predictor model, always use Adjusted R2R^2 or you'll systematically favor overfitting.


Classification Decision Metrics

These metrics evaluate how well your model assigns observations to categories. The core tension is between catching all positive cases (recall) and avoiding false alarms (precision).

Accuracy

  • Formula: Accuracy=TP+TNTP+TN+FP+FNAccuracy = \frac{TP + TN}{TP + TN + FP + FN}—proportion of all predictions that are correct
  • Misleading on imbalanced data—a model predicting "no fraud" for every transaction achieves 99.9% accuracy if only 0.1% are fraudulent
  • Appropriate baseline when classes are balanced and all errors are equally costly

Precision

  • Formula: Precision=TPTP+FPPrecision = \frac{TP}{TP + FP}—of all positive predictions, what fraction were actually positive
  • Minimizes false positives; critical when false alarms are costly (spam filtering, where legitimate emails in spam folder frustrate users)
  • Can be artificially inflated by making very few positive predictions—high precision doesn't mean you're finding all positives

Recall (Sensitivity)

  • Formula: Recall=TPTP+FNRecall = \frac{TP}{TP + FN}—of all actual positives, what fraction did we correctly identify
  • Minimizes false negatives; critical when missing positives is dangerous (cancer screening, fraud detection, safety systems)
  • Trade-off with precision—lowering the classification threshold increases recall but typically decreases precision

F1 Score

  • Formula: F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}—harmonic mean of precision and recall
  • Balances both error types when you can't afford to sacrifice either; harmonic mean punishes extreme imbalances
  • Single summary metric useful when you need one number to compare models but care about both false positives and false negatives

Compare: Precision vs. Recall—both focus on positive predictions but from opposite perspectives. Precision asks "when we say positive, are we right?" while Recall asks "did we find all the positives?" If an FRQ describes a medical screening scenario, emphasize recall; for a spam filter, emphasize precision.


Probabilistic and Threshold-Independent Metrics

These metrics evaluate the quality of predicted probabilities or performance across all possible decision thresholds. They're essential when you need to tune the precision-recall trade-off or when probability calibration matters.

Area Under the ROC Curve (AUC-ROC)

  • Measures discrimination ability across all classification thresholds by plotting true positive rate vs. false positive rate
  • Range 0 to 1, where 0.5 indicates random guessing and 1.0 indicates perfect separation of classes
  • Threshold-independent evaluation—useful when the optimal decision threshold isn't known or varies by application

Log Loss (Cross-Entropy)

  • Formula: LogLoss=1ni=1n[yilog(p^i)+(1yi)log(1p^i)]LogLoss = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]—penalizes confident wrong predictions heavily
  • Evaluates probability quality, not just classification decisions; a model predicting 0.51 for a true positive is penalized less than one predicting 0.99 for a false positive
  • Lower is better; commonly used as the loss function for logistic regression and neural network classifiers

Confusion Matrix

  • Four-cell table showing TP, TN, FP, and FN counts—the foundation from which accuracy, precision, recall, and F1 are calculated
  • Reveals error patterns that single metrics hide; two models with identical accuracy might have very different confusion matrices
  • Multi-class extension creates an n×nn \times n matrix where off-diagonal elements show which classes are confused with each other

Compare: AUC-ROC vs. Log Loss—both evaluate probabilistic classifiers, but AUC-ROC measures ranking ability (can the model separate classes?) while Log Loss measures calibration (are the predicted probabilities accurate?). A model can have high AUC but poor Log Loss if it ranks correctly but assigns overconfident probabilities.


Model Selection and Validation Metrics

These metrics help you choose between competing models and estimate real-world performance. The core challenge is avoiding overfitting—models that memorize training data but fail on new observations.

Cross-Validation

  • K-fold procedure partitions data into kk subsets, trains on k1k-1 folds, and validates on the held-out fold, rotating through all combinations
  • Reduces variance in performance estimates compared to a single train-test split; standard practice is k=5k = 5 or k=10k = 10
  • Prevents overfitting assessment—a model with high training accuracy but poor cross-validation accuracy is overfitting

Akaike Information Criterion (AIC)

  • Formula: AIC=2k2ln(L^)AIC = 2k - 2\ln(\hat{L})—balances model fit (likelihood L^\hat{L}) against complexity (number of parameters kk)
  • Lower AIC indicates better model when comparing candidates; penalizes adding parameters that don't sufficiently improve fit
  • Relative metric only—AIC values are meaningless in isolation; only differences between models matter

Compare: Cross-Validation vs. AIC—both guide model selection, but cross-validation directly estimates out-of-sample performance while AIC provides a theoretical approximation based on information theory. Cross-validation is computationally expensive but makes fewer assumptions; AIC is fast but assumes the true model is among candidates.


Quick Reference Table

ConceptBest Examples
Error magnitude (outlier-sensitive)MSE, RMSE
Error magnitude (robust)MAE, MAPE
Variance explainedR2R^2, Adjusted R2R^2
Classification with balanced classesAccuracy
Minimizing false positivesPrecision
Minimizing false negativesRecall
Balancing precision and recallF1 Score
Threshold-independent evaluationAUC-ROC
Probability calibrationLog Loss
Model complexity trade-offAIC, Adjusted R2R^2
Generalization estimationCross-Validation

Self-Check Questions

  1. You're building a model to detect fraudulent credit card transactions, where fraud represents 0.2% of all transactions. Why would accuracy be a poor choice of metric, and what would you use instead?

  2. Compare and contrast MSE and MAE: under what data conditions would they rank two models differently, and what does that difference reveal about the models?

  3. A colleague reports that their regression model achieves R2=0.95R^2 = 0.95 on training data. What additional metric would you request before concluding this is a good model, and why?

  4. Explain why a model could have high AUC-ROC but poor precision at a specific threshold. What does this tell you about the model's probability predictions?

  5. You're comparing three regression models with 3, 7, and 12 predictors respectively. The 12-predictor model has the highest R2R^2. Describe the analysis you would conduct to determine which model to deploy.