Fiveable

🎲Data Science Statistics Unit 17 Review

QR code for Data Science Statistics practice questions

17.3 Cross-validation and Model Selection

17.3 Cross-validation and Model Selection

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🎲Data Science Statistics
Unit & Topic Study Guides

Cross-validation and model selection are crucial for building reliable machine learning models. These techniques help assess model performance, prevent overfitting, and choose the best model for a given task.

By using methods like k-fold cross-validation and hyperparameter tuning, we can optimize our models and ensure they generalize well to new data. This connects to the broader themes of statistical learning theory and regularization.

Cross-validation Techniques

K-fold and Leave-one-out Cross-validation

  • K-fold cross-validation divides data into K subsets (folds) for model evaluation
    • Typically uses 5 or 10 folds
    • Trains model on K-1 folds and tests on remaining fold
    • Repeats process K times, with each fold serving as test set once
    • Provides robust estimate of model performance across different data partitions
  • Leave-one-out cross-validation represents extreme case of K-fold where K equals number of data points
    • Trains model on all but one data point, tests on excluded point
    • Repeats for each data point in dataset
    • Computationally intensive for large datasets but provides nearly unbiased estimate of model performance

Holdout Method and Data Partitioning

  • Holdout method splits data into separate training, validation, and test sets
  • Training set comprises largest portion (60-80%) used to fit model parameters
    • Exposes model to diverse examples for learning patterns and relationships
  • Validation set (10-20%) evaluates model performance during development
    • Helps tune hyperparameters and select best-performing model
  • Test set (10-20%) assesses final model performance on unseen data
    • Provides unbiased estimate of model's generalization ability
  • Stratified sampling ensures representative distribution of target variable across sets

Model Fitting and Complexity

Overfitting and Underfitting

  • Overfitting occurs when model learns noise in training data too closely
    • Results in high training accuracy but poor generalization to new data
    • Characterized by complex model with many parameters
    • Can be mitigated through regularization techniques (L1, L2 regularization)
  • Underfitting happens when model fails to capture underlying patterns in data
    • Produces poor performance on both training and test data
    • Often results from overly simple model or insufficient training
    • Addressed by increasing model complexity or using more sophisticated algorithms

Bias-Variance Tradeoff and Model Complexity

  • Bias-variance tradeoff balances model's ability to fit training data vs. generalize to new data
    • High bias models tend to underfit, missing important patterns (linear regression)
    • High variance models tend to overfit, capturing noise (decision trees)
  • Model complexity directly influences bias-variance tradeoff
    • Simple models (low complexity) often have high bias but low variance
    • Complex models (high complexity) typically have low bias but high variance
  • Optimal model complexity minimizes total error (bias + variance)
    • Achieved through techniques like cross-validation and regularization
  • Learning curves help visualize relationship between model complexity and performance
    • Plot training and validation error against model complexity or training set size

Hyperparameter Tuning

Hyperparameter Optimization Techniques

  • Hyperparameter tuning adjusts model configuration to optimize performance
    • Includes parameters not learned during training (learning rate, regularization strength)
  • Grid search systematically evaluates all combinations of predefined hyperparameter values
    • Creates grid of possible combinations and tests each one
    • Computationally expensive for large hyperparameter spaces
  • Random search samples hyperparameter values from specified distributions
    • Often more efficient than grid search, especially for high-dimensional spaces
    • Can discover good configurations with fewer iterations

Advanced Hyperparameter Tuning Methods

  • Bayesian optimization uses probabilistic model to guide search for optimal hyperparameters
    • Builds surrogate model of objective function to predict promising regions
    • Balances exploration of unknown areas with exploitation of known good regions
  • Genetic algorithms evolve population of hyperparameter configurations
    • Applies principles of natural selection to improve configurations over generations
    • Effective for large, complex hyperparameter spaces
  • Automated machine learning (AutoML) platforms automate entire process of hyperparameter tuning
    • Combines multiple optimization techniques to efficiently search hyperparameter space
    • Reduces need for manual intervention in model selection and tuning

Model Selection Criteria

Information Criteria

  • Akaike Information Criterion (AIC) estimates relative quality of statistical models
    • Balances model fit against complexity to prevent overfitting
    • Calculated as AIC=2k2ln(L^)AIC = 2k - 2\ln(\hat{L}) where k is number of parameters and L^\hat{L} is maximum likelihood
    • Lower AIC values indicate better models
  • Bayesian Information Criterion (BIC) similar to AIC but penalizes complexity more heavily
    • Calculated as BIC=ln(n)k2ln(L^)BIC = \ln(n)k - 2\ln(\hat{L}) where n is number of observations
    • Tends to favor simpler models compared to AIC
    • Particularly useful for large sample sizes

Performance Metrics for Model Evaluation

  • Classification metrics assess performance of categorical prediction models
    • Accuracy measures overall correctness of predictions
    • Precision quantifies proportion of true positive predictions
    • Recall (sensitivity) measures proportion of actual positives correctly identified
    • F1 score provides harmonic mean of precision and recall
  • Regression metrics evaluate continuous prediction models
    • Mean Squared Error (MSE) calculates average squared difference between predictions and actual values
    • Root Mean Squared Error (RMSE) provides interpretable metric in same units as target variable
    • R-squared (R2R^2) measures proportion of variance in dependent variable explained by model
  • Area Under the Receiver Operating Characteristic curve (AUC-ROC) assesses binary classification model performance
    • Plots true positive rate against false positive rate at various threshold settings
    • AUC of 0.5 indicates random guessing, 1.0 indicates perfect classification
Pep mascot
Upgrade your Fiveable account to print any study guide

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Click below to go to billing portal → update your plan → choose Yearly → and select "Fiveable Share Plan". Only pay the difference

Plan is open to all students, teachers, parents, etc
Pep mascot
Upgrade your Fiveable account to export vocabulary

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Plan is open to all students, teachers, parents, etc
report an error
description

screenshots help us find and fix the issue faster (optional)

add screenshot

2,589 studying →