Cross-validation estimates how well a model will perform on new, unseen data. Instead of relying on a single train-test split (which can give misleading results depending on how the data happens to divide), cross-validation systematically rotates which portion of the data serves as the test set. This produces more stable and trustworthy performance estimates, which is exactly what you need when choosing between competing models or tuning regularization parameters.

Principles and goals

The core idea is straightforward: partition your data into subsets, train on some subsets, validate on the held-out subset, and repeat. By cycling through multiple splits, you get several performance estimates that you can average together.

Cross-validation serves several goals in the model selection process:

Estimating generalization error: Rather than reporting how well a model fits its training data, cross-validation approximates how it will perform on data it hasn't seen.
Detecting overfitting: If a model performs well on training folds but poorly on validation folds, that gap signals overfitting.
Reducing split dependence: A single random split can be unrepresentative, especially with small datasets. Averaging over multiple splits smooths out that randomness.

Applications in model selection

Within the context of linear modeling, cross-validation plays a direct role in several tasks:

Model comparison: You can compare models with different predictor sets (e.g., a 3-variable model vs. a 7-variable model) by seeing which achieves lower cross-validated mean squared error (MSE).
Tuning regularization: Methods like ridge regression and LASSO have a tuning parameter $\lambda$ that controls the strength of the penalty. Cross-validation helps you find the $\lambda$ that minimizes prediction error.
Variable screening: Cross-validation can evaluate whether adding or removing a predictor actually improves out-of-sample prediction, rather than just improving in-sample fit.

Implementing k-fold vs. leave-one-out

K-fold cross-validation

K-fold is the most commonly used form of cross-validation. Here's how it works:

Randomly shuffle the dataset and split it into $k$ equally sized groups (called folds).
For the first iteration, hold out fold 1 as the validation set and train the model on folds 2 through $k$ .
Compute a performance metric (e.g., MSE) on the held-out fold.
Repeat for each fold, so every fold serves as the validation set exactly once.
Average the $k$ performance metrics to get the cross-validated estimate.

Common choices are $k = 5$ or $k = 10$ . The choice involves a bias-variance tradeoff in the estimate itself:

Smaller $k$ (e.g., 5): Each training set is smaller (80% of the data), so the performance estimate has slightly more bias (it underestimates how well the model would do with all the data). But the estimates across folds are less correlated, so variance is lower.
Larger $k$ (e.g., 10 or more): Each training set is closer to the full dataset size, reducing bias. But the training sets overlap heavily, making the fold-level estimates more correlated and potentially increasing variance.

For most linear modeling applications, 5-fold or 10-fold cross-validation works well.

Leave-one-out cross-validation (LOOCV)

LOOCV is the extreme case where $k = n$ (the number of observations). Each iteration holds out a single observation, trains on the remaining $n - 1$ , and predicts the held-out point.

Advantage: Nearly unbiased estimate of prediction error, since each training set is almost the full dataset.
Disadvantage: Computationally expensive for large $n$ , since you fit the model $n$ times. Also, because the $n$ training sets differ by only one observation, the resulting estimates are highly correlated, which can inflate variance of the overall CV estimate.

For ordinary least squares regression, there's a useful shortcut. The LOOCV estimate can be computed from a single model fit using the hat matrix $\mathbf{H}$ :

$\text{CV}_{(n)} = \frac{1}{n} \sum_{i=1}^{n} \left( \frac{e_i}{1 - h_{ii}} \right)^2$

where $e_i$ is the $i$ -th residual and $h_{ii}$ is the $i$ -th diagonal element of the hat matrix. This avoids fitting $n$ separate models, making LOOCV practical for linear models even with moderately large datasets.

LOOCV is most useful when your dataset is small and you can't afford to set aside a large validation fold.

Principles and goals of cross-validation, Time series cross-validation using crossval

Interpreting cross-validation results

Comparing models and selecting optimal complexity

When you run cross-validation on several candidate models, you'll get an average performance metric and its variability across folds for each model. Here's how to use those results:

Compare average CV error: Lower average MSE (or higher $R^2$ ) across folds indicates better expected performance on new data.
Check fold-to-fold variability: Large standard deviation across folds suggests the model's performance is unstable. This can indicate overfitting or sensitivity to which observations end up in the training set.
Apply the one-standard-error rule: A common and practical guideline is to choose the simplest model whose CV error is within one standard error of the minimum CV error. This favors parsimony when two models perform similarly, which aligns with the goals of variable screening.

For example, if you're comparing polynomial regression models of degrees 1 through 6, you might find that degree 3 has the lowest CV error but degree 2 is within one standard error of that minimum. The one-standard-error rule would favor the degree-2 model for its simplicity.

Nested cross-validation

A subtle but important problem arises when you use cross-validation both to tune hyperparameters and to report final model performance. If you use the same CV loop for both, the reported performance is optimistically biased because the hyperparameters were chosen to look good on those particular folds.

Nested cross-validation solves this with two layers:

Outer loop: Splits data into training and test folds (just like standard k-fold). This loop produces the final, unbiased performance estimate.
Inner loop: Within each outer training set, runs another round of cross-validation to select the best hyperparameters or model specification.
The model chosen by the inner loop is then evaluated on the outer test fold.

This structure keeps the outer test data completely untouched during model selection, preventing data leakage. It's more computationally demanding (if you use 5-fold for both loops, you fit $5 \times 5 = 25$ models), but it gives you an honest estimate of how well your entire model-selection pipeline will perform on truly new data.

Advantages of cross-validation

Robustness compared to a single train-test split

A single random split can be misleading. If the test set happens to contain mostly "easy" observations, performance looks artificially good; if it contains outliers or unusual cases, performance looks artificially bad. Cross-validation averages over multiple splits, so no single unlucky partition dominates the result.

Cross-validation also uses data more efficiently. In a single 80/20 split, 20% of your data never contributes to training. In k-fold CV, every observation appears in the training set $k - 1$ times and in the validation set exactly once.

Benefits for small datasets

When data is limited, cross-validation becomes especially valuable:

You can't afford to lock away a large chunk of data as a permanent test set. Cross-validation lets every observation contribute to both training and evaluation.
Performance estimates from small test sets are noisy. Averaging across folds reduces that noise.
LOOCV, despite its variance concerns, maximizes training set size for each iteration, which matters when every observation counts.

Even with cross-validation, very small datasets will produce estimates with meaningful uncertainty. Reporting the standard error of your CV estimate (not just the mean) gives a more honest picture of how confident you should be in the result.

2,589 studying →