Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Model evaluation is the difference between a model that looks good on paper and one that actually works in production. You're being tested on your ability to select the right metric for the right problem—understanding why accuracy fails on imbalanced datasets, when to prioritize precision over recall, and how cross-validation prevents the overfitting trap that catches junior engineers. These concepts appear constantly in system design interviews, ML certification exams, and real-world debugging scenarios.
The techniques below demonstrate core principles of generalization, error decomposition, and threshold-dependent decision-making. Don't just memorize formulas—know what each metric reveals about your model's behavior and when to reach for it. A model with 99% accuracy might be useless; a model with 0.85 AUC might be exactly what you need. Understanding the difference is what separates competent ML engineers from the rest.
How you partition your data determines whether your evaluation reflects real-world performance or just memorization of training examples.
Compare: Holdout vs. Cross-Validation—both assess generalization, but holdout trades thoroughness for speed. Use holdout for rapid prototyping; use k-fold (typically k=5 or k=10) when you need reliable estimates for model selection or hyperparameter tuning.
Classification metrics answer different questions: How often are you right? How costly are your mistakes? Which errors matter more for your use case?
Compare: F1 Score vs. AUC—F1 evaluates performance at a specific threshold and weights precision/recall equally. AUC evaluates across all thresholds and treats classes symmetrically. Use F1 when you've chosen your threshold and care about positive class performance; use AUC when comparing models or when threshold selection comes later.
Regression metrics quantify prediction error magnitude—but they weight errors differently, so your choice shapes what "good" means.
Compare: MSE vs. R²—MSE gives you absolute error magnitude (useful for setting expectations), while R² gives you relative explanatory power (useful for comparing models). A model can have low R² but acceptable MSE if the underlying signal is inherently noisy.
These techniques reveal whether your model is fundamentally flawed in its assumptions or just needs more data.
Compare: Learning Curves vs. Cross-Validation—both diagnose generalization issues, but learning curves show how performance changes with data quantity while cross-validation gives you a robust estimate at your current data size. Use learning curves to decide if data collection is worth the investment.
| Concept | Best Examples |
|---|---|
| Data splitting for generalization | Cross-Validation, Holdout Method |
| Classification threshold analysis | ROC Curve, AUC, Precision-Recall tradeoff |
| Imbalanced class handling | F1 Score, Precision, Recall, Stratified CV |
| Regression error measurement | MSE, RMSE, R², Adjusted R² |
| Model complexity diagnosis | Bias-Variance Tradeoff, Learning Curves |
| Overfitting detection | Cross-Validation, Learning Curves, Holdout validation gap |
| Error pattern analysis | Confusion Matrix |
You're building a fraud detection system where catching fraud matters more than avoiding false alarms. Which metrics should you prioritize, and why might accuracy be misleading here?
Your model achieves 95% training accuracy but only 70% validation accuracy. What does this gap indicate, and which two techniques would help you diagnose the root cause?
Compare and contrast R² and RMSE: In what scenario might a model have a high R² but still be unsuitable for production deployment?
You're comparing three classification models and haven't yet decided on a decision threshold. Which metric allows fair comparison, and what value would indicate performance no better than random guessing?
Your learning curve shows training and validation error both plateauing at a high value as data increases. Is this a bias or variance problem, and what's your next step to improve performance?