Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Cross-validation is the backbone of honest model evaluation—it's how you answer the fundamental question: will this model actually work on data it hasn't seen? You're being tested on understanding bias-variance tradeoffs, overfitting prevention, and proper experimental design. Every technique here represents a different solution to the same problem: how do we squeeze the most reliable performance estimate from limited data without cheating?
Don't just memorize which method splits data how many times. Know why you'd choose one technique over another: What happens to your variance estimate with more folds? When does temporal ordering matter? Why can't you just randomly shuffle clinical trial data? These conceptual questions—about data leakage, computational cost, and generalization guarantees—are what separate surface-level recall from genuine understanding.
These foundational techniques establish the core principle: systematically rotate which data trains the model and which data tests it to get stable performance estimates.
Compare: K-Fold vs. LOOCV—both systematically rotate validation sets, but K-Fold trades slightly higher bias for dramatically lower variance and computation. If an FRQ asks about choosing validation strategy for a moderate-sized dataset, K-Fold is almost always the defensible answer.
When your data violates the assumption of independent, identically distributed observations, standard methods leak information and produce overoptimistic estimates. These techniques preserve the structure that makes your data special.
Compare: Stratified K-Fold vs. Group K-Fold—Stratified preserves outcome distributions across folds; Group preserves observation independence by keeping related samples together. Choose based on whether your concern is class imbalance or correlated observations.
Single cross-validation runs can be noisy. These methods trade computation for more stable estimates by repeating or resampling.
Compare: Repeated K-Fold vs. Bootstrap—Repeated K-Fold averages over different partitions of the same data; Bootstrap creates genuinely different training sets through resampling. Bootstrap is better for uncertainty quantification, but can be optimistically biased for error estimation.
When you're simultaneously tuning hyperparameters and evaluating model performance, you need extra structure to avoid selection bias contaminating your results.
Compare: Standard K-Fold vs. Nested CV—Standard K-Fold is fine if you're evaluating a fixed model, but the moment you tune hyperparameters using validation performance, you need the outer loop of Nested CV to get honest generalization estimates. This distinction is prime FRQ material.
| Concept | Best Examples |
|---|---|
| Bias-variance tradeoff in fold count | K-Fold, LOOCV, Hold-Out |
| Preserving class distributions | Stratified K-Fold |
| Temporal data integrity | Time Series CV |
| Grouped/clustered observations | Group K-Fold |
| Reducing estimate variance | Repeated K-Fold, Random Subsampling |
| Uncertainty quantification | Bootstrap Sampling |
| Unbiased hyperparameter tuning | Nested Cross-Validation |
| Computational efficiency | Hold-Out, K-Fold with small |
You have a dataset with 50 observations and severe class imbalance (5% positive). Which two cross-validation techniques would you combine, and why does each address a different problem?
A colleague uses 10-Fold CV to both select the best regularization parameter and report final model accuracy. What's wrong with this approach, and which technique fixes it?
Compare LOOCV and 10-Fold CV in terms of bias, variance, and computational cost. Under what circumstances would you choose each?
You're building a model to predict patient outcomes using data where each patient has multiple visits recorded. Standard K-Fold gives you 95% accuracy, but the model fails in deployment. What went wrong, and which validation technique should you have used?
Explain why random shuffling before Time Series CV would invalidate your results, even if your performance metrics look excellent.