🤖Statistical Prediction

Model Evaluation Metrics

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Choosing the right evaluation metric is one of the most consequential decisions you'll make in any machine learning project—and it's exactly what you're being tested on. The metric you select shapes how your model learns, what errors it prioritizes, and whether your predictions actually solve the problem at hand. A model optimized for accuracy might fail catastrophically on imbalanced data, while one tuned for precision could miss critical positive cases. Understanding these trade-offs is fundamental to model selection, hyperparameter tuning, and communicating results.

This guide organizes metrics by their underlying purpose: measuring prediction error magnitude, explaining variance, evaluating classification decisions, and comparing models. Don't just memorize formulas—know when each metric is appropriate, what its limitations are, and how different metrics can tell contradictory stories about the same model. Exam questions often present scenarios where you must justify your choice of metric or interpret conflicting results.

Error Magnitude Metrics (Regression)

These metrics quantify how far your predictions deviate from actual values. The key distinction is how they weight errors—squared terms amplify large errors, while absolute terms treat all errors equally.

Mean Squared Error (MSE)

Formula: $MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$ —averages the squared differences between predicted and actual values
Outlier sensitivity is the defining characteristic; squaring errors means one large miss can dominate the metric
Optimization-friendly because the squared term is differentiable everywhere, making it the default loss function for many algorithms

Root Mean Squared Error (RMSE)

Formula: $RMSE = \sqrt{MSE}$ —transforms MSE back to the original units of the target variable
Interpretability advantage over MSE; if predicting home prices in dollars, RMSE is also in dollars
Same outlier sensitivity as MSE—taking the square root doesn't change which errors dominate, just the scale

Mean Absolute Error (MAE)

Formula: $MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$ —averages the absolute differences without squaring
Robust to outliers because large errors aren't amplified; useful when extreme values are noise rather than signal
Median-like behavior—optimizing for MAE pushes predictions toward the conditional median rather than mean

Mean Absolute Percentage Error (MAPE)

Formula: $MAPE = \frac{100\%}{n}\sum_{i=1}^{n}\left|\frac{y_i - \hat{y}_i}{y_i}\right|$ —expresses error as a percentage of actual values
Scale-independent interpretation makes it useful for comparing models across different datasets or units
Undefined when $y_i = 0$ and asymmetric—overestimates are penalized differently than underestimates of the same magnitude

Compare: MSE/RMSE vs. MAE—both measure prediction error, but MSE/RMSE penalize large errors quadratically while MAE treats all errors linearly. If an FRQ asks which metric to use when outliers represent genuine phenomena (not noise), choose RMSE; if outliers are measurement errors, choose MAE.

Variance Explanation Metrics (Regression)

These metrics tell you how much of the target variable's variability your model captures. They answer a different question than error metrics: not "how wrong are predictions?" but "how much pattern did we find?"

R-squared (Coefficient of Determination)

Formula: $R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}$ —proportion of variance explained by the model
Range typically 0 to 1 for in-sample data, where 1 means perfect prediction and 0 means no better than predicting the mean
Always increases with more predictors, even useless ones—this is the fundamental limitation that motivates adjusted versions

Adjusted R-squared

Formula: $R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$ —penalizes model complexity by accounting for number of predictors $p$
Can decrease when adding predictors that don't improve fit enough to justify their inclusion
Essential for model comparison when candidate models have different numbers of features; raw $R^2$ will always favor the more complex model

Compare: $R^2$ vs. Adjusted $R^2$ —both measure variance explained, but only Adjusted $R^2$ accounts for model complexity. When comparing a 3-predictor model to a 10-predictor model, always use Adjusted $R^2$ or you'll systematically favor overfitting.

Classification Decision Metrics

These metrics evaluate how well your model assigns observations to categories. The core tension is between catching all positive cases (recall) and avoiding false alarms (precision).

Accuracy

Formula: $Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$ —proportion of all predictions that are correct
Misleading on imbalanced data—a model predicting "no fraud" for every transaction achieves 99.9% accuracy if only 0.1% are fraudulent
Appropriate baseline when classes are balanced and all errors are equally costly

Precision

Formula: $Precision = \frac{TP}{TP + FP}$ —of all positive predictions, what fraction were actually positive
Minimizes false positives; critical when false alarms are costly (spam filtering, where legitimate emails in spam folder frustrate users)
Can be artificially inflated by making very few positive predictions—high precision doesn't mean you're finding all positives

Recall (Sensitivity)

Formula: $Recall = \frac{TP}{TP + FN}$ —of all actual positives, what fraction did we correctly identify
Minimizes false negatives; critical when missing positives is dangerous (cancer screening, fraud detection, safety systems)
Trade-off with precision—lowering the classification threshold increases recall but typically decreases precision

F1 Score

Formula: $F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}$ —harmonic mean of precision and recall
Balances both error types when you can't afford to sacrifice either; harmonic mean punishes extreme imbalances
Single summary metric useful when you need one number to compare models but care about both false positives and false negatives

Compare: Precision vs. Recall—both focus on positive predictions but from opposite perspectives. Precision asks "when we say positive, are we right?" while Recall asks "did we find all the positives?" If an FRQ describes a medical screening scenario, emphasize recall; for a spam filter, emphasize precision.

Probabilistic and Threshold-Independent Metrics

These metrics evaluate the quality of predicted probabilities or performance across all possible decision thresholds. They're essential when you need to tune the precision-recall trade-off or when probability calibration matters.

Area Under the ROC Curve (AUC-ROC)

Measures discrimination ability across all classification thresholds by plotting true positive rate vs. false positive rate
Range 0 to 1, where 0.5 indicates random guessing and 1.0 indicates perfect separation of classes
Threshold-independent evaluation—useful when the optimal decision threshold isn't known or varies by application

Log Loss (Cross-Entropy)

Formula: $LogLoss = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]$ —penalizes confident wrong predictions heavily
Evaluates probability quality, not just classification decisions; a model predicting 0.51 for a true positive is penalized less than one predicting 0.99 for a false positive
Lower is better; commonly used as the loss function for logistic regression and neural network classifiers

Confusion Matrix

Four-cell table showing TP, TN, FP, and FN counts—the foundation from which accuracy, precision, recall, and F1 are calculated
Reveals error patterns that single metrics hide; two models with identical accuracy might have very different confusion matrices
Multi-class extension creates an $n \times n$ matrix where off-diagonal elements show which classes are confused with each other

Compare: AUC-ROC vs. Log Loss—both evaluate probabilistic classifiers, but AUC-ROC measures ranking ability (can the model separate classes?) while Log Loss measures calibration (are the predicted probabilities accurate?). A model can have high AUC but poor Log Loss if it ranks correctly but assigns overconfident probabilities.

Model Selection and Validation Metrics

These metrics help you choose between competing models and estimate real-world performance. The core challenge is avoiding overfitting—models that memorize training data but fail on new observations.

Cross-Validation

K-fold procedure partitions data into $k$ subsets, trains on $k-1$ folds, and validates on the held-out fold, rotating through all combinations
Reduces variance in performance estimates compared to a single train-test split; standard practice is $k = 5$ or $k = 10$
Prevents overfitting assessment—a model with high training accuracy but poor cross-validation accuracy is overfitting

Akaike Information Criterion (AIC)

Formula: $AIC = 2k - 2\ln(\hat{L})$ —balances model fit (likelihood $\hat{L}$ ) against complexity (number of parameters $k$ )
Lower AIC indicates better model when comparing candidates; penalizes adding parameters that don't sufficiently improve fit
Relative metric only—AIC values are meaningless in isolation; only differences between models matter

Compare: Cross-Validation vs. AIC—both guide model selection, but cross-validation directly estimates out-of-sample performance while AIC provides a theoretical approximation based on information theory. Cross-validation is computationally expensive but makes fewer assumptions; AIC is fast but assumes the true model is among candidates.

Quick Reference Table

Concept	Best Examples
Error magnitude (outlier-sensitive)	MSE, RMSE
Error magnitude (robust)	MAE, MAPE
Variance explained	$R^2$ , Adjusted $R^2$
Classification with balanced classes	Accuracy
Minimizing false positives	Precision
Minimizing false negatives	Recall
Balancing precision and recall	F1 Score
Threshold-independent evaluation	AUC-ROC
Probability calibration	Log Loss
Model complexity trade-off	AIC, Adjusted $R^2$
Generalization estimation	Cross-Validation

Self-Check Questions

You're building a model to detect fraudulent credit card transactions, where fraud represents 0.2% of all transactions. Why would accuracy be a poor choice of metric, and what would you use instead?
Compare and contrast MSE and MAE: under what data conditions would they rank two models differently, and what does that difference reveal about the models?
A colleague reports that their regression model achieves $R^2 = 0.95$ on training data. What additional metric would you request before concluding this is a good model, and why?
Explain why a model could have high AUC-ROC but poor precision at a specific threshold. What does this tell you about the model's probability predictions?
You're comparing three regression models with 3, 7, and 12 predictors respectively. The 12-predictor model has the highest $R^2$ . Describe the analysis you would conduct to determine which model to deploy.

🤖Statistical Prediction

Model Evaluation Metrics

Why This Matters

Error Magnitude Metrics (Regression)

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

Mean Absolute Error (MAE)

Mean Absolute Percentage Error (MAPE)

Variance Explanation Metrics (Regression)

R-squared (Coefficient of Determination)

Adjusted R-squared

Classification Decision Metrics

Accuracy

Precision

Recall (Sensitivity)

F1 Score

Probabilistic and Threshold-Independent Metrics

Area Under the ROC Curve (AUC-ROC)

Log Loss (Cross-Entropy)

Confusion Matrix

Model Selection and Validation Metrics

Cross-Validation

Akaike Information Criterion (AIC)

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes