Choosing the right evaluation metric is one of the most consequential decisions you'll make in any machine learning project—and it's exactly what you're being tested on. The metric you select shapes how your model learns, what errors it prioritizes, and whether your predictions actually solve the problem at hand. A model optimized for accuracy might fail catastrophically on imbalanced data, while one tuned for precision could miss critical positive cases. Understanding these trade-offs is fundamental to model selection, hyperparameter tuning, and communicating results.
This guide organizes metrics by their underlying purpose: measuring prediction error magnitude, explaining variance, evaluating classification decisions, and comparing models. Don't just memorize formulas—know when each metric is appropriate, what its limitations are, and how different metrics can tell contradictory stories about the same model. Exam questions often present scenarios where you must justify your choice of metric or interpret conflicting results.
Error Magnitude Metrics (Regression)
These metrics quantify how far your predictions deviate from actual values. The key distinction is how they weight errors—squared terms amplify large errors, while absolute terms treat all errors equally.
Mean Squared Error (MSE)
Formula: MSE=n1∑i=1n(yi−y^i)2—averages the squared differences between predicted and actual values
Outlier sensitivity is the defining characteristic; squaring errors means one large miss can dominate the metric
Optimization-friendly because the squared term is differentiable everywhere, making it the default loss function for many algorithms
Root Mean Squared Error (RMSE)
Formula: RMSE=MSE—transforms MSE back to the original units of the target variable
Interpretability advantage over MSE; if predicting home prices in dollars, RMSE is also in dollars
Same outlier sensitivity as MSE—taking the square root doesn't change which errors dominate, just the scale
Mean Absolute Error (MAE)
Formula: MAE=n1∑i=1n∣yi−y^i∣—averages the absolute differences without squaring
Robust to outliers because large errors aren't amplified; useful when extreme values are noise rather than signal
Median-like behavior—optimizing for MAE pushes predictions toward the conditional median rather than mean
Mean Absolute Percentage Error (MAPE)
Formula: MAPE=n100%∑i=1nyiyi−y^i—expresses error as a percentage of actual values
Scale-independent interpretation makes it useful for comparing models across different datasets or units
Undefined when yi=0 and asymmetric—overestimates are penalized differently than underestimates of the same magnitude
Compare: MSE/RMSE vs. MAE—both measure prediction error, but MSE/RMSE penalize large errors quadratically while MAE treats all errors linearly. If an FRQ asks which metric to use when outliers represent genuine phenomena (not noise), choose RMSE; if outliers are measurement errors, choose MAE.
Variance Explanation Metrics (Regression)
These metrics tell you how much of the target variable's variability your model captures. They answer a different question than error metrics: not "how wrong are predictions?" but "how much pattern did we find?"
R-squared (Coefficient of Determination)
Formula: R2=1−SStotSSres=1−∑(yi−yˉ)2∑(yi−y^i)2—proportion of variance explained by the model
Range typically 0 to 1 for in-sample data, where 1 means perfect prediction and 0 means no better than predicting the mean
Always increases with more predictors, even useless ones—this is the fundamental limitation that motivates adjusted versions
Adjusted R-squared
Formula: Radj2=1−n−p−1(1−R2)(n−1)—penalizes model complexity by accounting for number of predictors p
Can decrease when adding predictors that don't improve fit enough to justify their inclusion
Essential for model comparison when candidate models have different numbers of features; raw R2 will always favor the more complex model
Compare:R2 vs. Adjusted R2—both measure variance explained, but only Adjusted R2 accounts for model complexity. When comparing a 3-predictor model to a 10-predictor model, always use Adjusted R2 or you'll systematically favor overfitting.
Classification Decision Metrics
These metrics evaluate how well your model assigns observations to categories. The core tension is between catching all positive cases (recall) and avoiding false alarms (precision).
Accuracy
Formula: Accuracy=TP+TN+FP+FNTP+TN—proportion of all predictions that are correct
Misleading on imbalanced data—a model predicting "no fraud" for every transaction achieves 99.9% accuracy if only 0.1% are fraudulent
Appropriate baseline when classes are balanced and all errors are equally costly
Precision
Formula: Precision=TP+FPTP—of all positive predictions, what fraction were actually positive
Minimizes false positives; critical when false alarms are costly (spam filtering, where legitimate emails in spam folder frustrate users)
Can be artificially inflated by making very few positive predictions—high precision doesn't mean you're finding all positives
Recall (Sensitivity)
Formula: Recall=TP+FNTP—of all actual positives, what fraction did we correctly identify
Minimizes false negatives; critical when missing positives is dangerous (cancer screening, fraud detection, safety systems)
Trade-off with precision—lowering the classification threshold increases recall but typically decreases precision
F1 Score
Formula: F1=2⋅Precision+RecallPrecision⋅Recall—harmonic mean of precision and recall
Balances both error types when you can't afford to sacrifice either; harmonic mean punishes extreme imbalances
Single summary metric useful when you need one number to compare models but care about both false positives and false negatives
Compare: Precision vs. Recall—both focus on positive predictions but from opposite perspectives. Precision asks "when we say positive, are we right?" while Recall asks "did we find all the positives?" If an FRQ describes a medical screening scenario, emphasize recall; for a spam filter, emphasize precision.
Probabilistic and Threshold-Independent Metrics
These metrics evaluate the quality of predicted probabilities or performance across all possible decision thresholds. They're essential when you need to tune the precision-recall trade-off or when probability calibration matters.
Area Under the ROC Curve (AUC-ROC)
Measures discrimination ability across all classification thresholds by plotting true positive rate vs. false positive rate
Range 0 to 1, where 0.5 indicates random guessing and 1.0 indicates perfect separation of classes
Threshold-independent evaluation—useful when the optimal decision threshold isn't known or varies by application
Evaluates probability quality, not just classification decisions; a model predicting 0.51 for a true positive is penalized less than one predicting 0.99 for a false positive
Lower is better; commonly used as the loss function for logistic regression and neural network classifiers
Confusion Matrix
Four-cell table showing TP, TN, FP, and FN counts—the foundation from which accuracy, precision, recall, and F1 are calculated
Reveals error patterns that single metrics hide; two models with identical accuracy might have very different confusion matrices
Multi-class extension creates an n×n matrix where off-diagonal elements show which classes are confused with each other
Compare: AUC-ROC vs. Log Loss—both evaluate probabilistic classifiers, but AUC-ROC measures ranking ability (can the model separate classes?) while Log Loss measures calibration (are the predicted probabilities accurate?). A model can have high AUC but poor Log Loss if it ranks correctly but assigns overconfident probabilities.
Model Selection and Validation Metrics
These metrics help you choose between competing models and estimate real-world performance. The core challenge is avoiding overfitting—models that memorize training data but fail on new observations.
Cross-Validation
K-fold procedure partitions data into k subsets, trains on k−1 folds, and validates on the held-out fold, rotating through all combinations
Reduces variance in performance estimates compared to a single train-test split; standard practice is k=5 or k=10
Prevents overfitting assessment—a model with high training accuracy but poor cross-validation accuracy is overfitting
Akaike Information Criterion (AIC)
Formula: AIC=2k−2ln(L^)—balances model fit (likelihood L^) against complexity (number of parameters k)
Lower AIC indicates better model when comparing candidates; penalizes adding parameters that don't sufficiently improve fit
Relative metric only—AIC values are meaningless in isolation; only differences between models matter
Compare: Cross-Validation vs. AIC—both guide model selection, but cross-validation directly estimates out-of-sample performance while AIC provides a theoretical approximation based on information theory. Cross-validation is computationally expensive but makes fewer assumptions; AIC is fast but assumes the true model is among candidates.
Quick Reference Table
Concept
Best Examples
Error magnitude (outlier-sensitive)
MSE, RMSE
Error magnitude (robust)
MAE, MAPE
Variance explained
R2, Adjusted R2
Classification with balanced classes
Accuracy
Minimizing false positives
Precision
Minimizing false negatives
Recall
Balancing precision and recall
F1 Score
Threshold-independent evaluation
AUC-ROC
Probability calibration
Log Loss
Model complexity trade-off
AIC, Adjusted R2
Generalization estimation
Cross-Validation
Self-Check Questions
You're building a model to detect fraudulent credit card transactions, where fraud represents 0.2% of all transactions. Why would accuracy be a poor choice of metric, and what would you use instead?
Compare and contrast MSE and MAE: under what data conditions would they rank two models differently, and what does that difference reveal about the models?
A colleague reports that their regression model achieves R2=0.95 on training data. What additional metric would you request before concluding this is a good model, and why?
Explain why a model could have high AUC-ROC but poor precision at a specific threshold. What does this tell you about the model's probability predictions?
You're comparing three regression models with 3, 7, and 12 predictors respectively. The 12-predictor model has the highest R2. Describe the analysis you would conduct to determine which model to deploy.