Define the problem. What are you trying to predict or explain? A clear research question drives every decision that follows.
Collect and prepare data. Clean your dataset, handle missing values, and format variables appropriately.
Conduct exploratory data analysis (EDA). Understand relationships between variables, spot outliers or influential observations, and let the data guide your variable selection and model specification.
Select variables and specify the model. Choose which predictors to include and what functional form the model should take.
Estimate model parameters. Fit the model using least squares (or a regularized variant).
Assess model fit. Check diagnostics, examine residuals, and evaluate performance metrics.
Validate the model. Test on held-out data or use cross-validation to confirm the model generalizes.
Use the model for prediction or inference.

Throughout this process, you're balancing the bias-variance trade-off. More complex models can capture subtle patterns but risk overfitting. Simpler models may miss real signal but tend to generalize better. The principle of parsimony (Occam's razor) says: favor the simpler model when two models perform comparably.

At each iteration, assess model assumptions, check for multicollinearity, examine residuals, and consider whether variable transformations or interaction terms could improve fit and interpretability.

EDA is where you build intuition about your data before committing to a model form. The key techniques:

Univariate exploration: Use histograms, density plots, or box plots to examine the distribution of each variable. Look for skewness, outliers, or unusual features that might call for a transformation (e.g., a log transform for right-skewed data).
Bivariate relationships: Scatterplots and correlation matrices reveal pairwise relationships between predictors and the response. You're looking for linear patterns, but also nonlinear ones that might suggest polynomial terms or interactions.
Multicollinearity checks: When predictor variables are highly correlated with each other, coefficient estimates become unstable and hard to interpret. Calculate variance inflation factors (VIF) for each predictor. A common rule of thumb is that VIF values above 5 or 10 signal problematic multicollinearity.
Residual analysis: After fitting an initial model, plot residuals against fitted values and against each predictor. These plots help you assess linearity, homoscedasticity (constant variance), and independence. If assumptions are violated, consider remedial measures like weighted least squares or robust regression.

Choosing Predictors and Complexity

Systematic Approach to Model Building, Data Analysis with R

Variable Selection Methods

Choosing which predictors to include is one of the most consequential decisions in model building. Several formal methods exist:

Best subset selection evaluates all possible combinations of predictors and picks the best model according to a criterion like adjusted $R^2$ or Mallow's $C_p$ . It's thorough but computationally expensive when you have many predictors.
Forward stepwise selection starts with no predictors and adds the most significant one at each step until a stopping criterion is met.
Backward elimination starts with all predictors and removes the least significant one at each step.
Mixed (stepwise) selection combines both directions, adding and removing predictors at each step.

None of these methods should operate in a vacuum. Domain knowledge matters. If theory or prior research says a variable should matter, include it as a candidate even if a purely data-driven method might skip it.

The hierarchy principle is also important: if you include an interaction term (say $X_1 \times X_2$ ), keep both main effects ( $X_1$ and $X_2$ ) in the model, even if one isn't individually significant.

For high-dimensional settings where the number of predictors is large relative to sample size, regularization methods become essential:

Ridge regression adds an $L_2$ penalty that shrinks coefficients toward zero but doesn't set any exactly to zero.
Lasso adds an $L_1$ penalty that can shrink some coefficients all the way to zero, effectively performing variable selection.
Elastic net combines both penalties, which is particularly useful when predictors are highly correlated.

Bias-Variance Trade-off and Model Complexity

The bias-variance trade-off is the central tension in model selection. A model that's too simple (high bias) will systematically miss real patterns. A model that's too complex (high variance) will fit noise in the training data and perform poorly on new observations.

To find the right balance:

Cross-validation (discussed below) gives you an honest estimate of how well a model generalizes by testing on data the model hasn't seen.
Information criteria like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) penalize model complexity directly. BIC penalizes more heavily than AIC, so it tends to favor simpler models.
Stability assessment using bootstrap resampling or permutation tests can reveal whether your selected predictors are robust. If a variable enters the model in some bootstrap samples but not others, that's a sign it may not be a reliable predictor.

Validating Linear Models

Systematic Approach to Model Building, Bias-Variance Trade-off

Cross-Validation Techniques

Cross-validation is the standard tool for estimating how well your model will perform on unseen data. The core idea: split your data, train on part of it, and test on the rest.

Common methods:

k-fold cross-validation: Split the data into $k$ equally sized folds. Train on $k - 1$ folds, test on the remaining fold, and rotate through all $k$ folds. 5-fold and 10-fold are the most common choices.
Leave-one-out cross-validation (LOOCV): A special case where $k = n$ (each observation gets its own fold). It's nearly unbiased but computationally expensive and can have high variance.
Repeated k-fold: Run k-fold multiple times with different random splits and average the results for a more stable estimate.

Choosing a performance metric depends on your problem:

MSE (mean squared error) or RMSE (root MSE) for general prediction accuracy
MAE (mean absolute error) if you want a metric less sensitive to large errors
$R^2$ for proportion of variance explained

Nested cross-validation is necessary when you're simultaneously selecting a model (or tuning hyperparameters) and estimating performance. The outer loop estimates performance; the inner loop handles model selection. This prevents data leakage, where information from the test set inadvertently influences model choices.

Residual Diagnostics and Model Assumptions

After fitting your model, residual diagnostics tell you whether the model's assumptions actually hold.

Residuals vs. fitted values: Look for patterns. A random scatter suggests the linearity and constant variance assumptions are met. A funnel shape indicates heteroscedasticity. A curve suggests the relationship isn't purely linear.
Residuals vs. predictor variables: Similar logic, but checked against each predictor individually. This can reveal which predictor is causing a problem.
Normality of residuals: Use a Q-Q plot (quantile-quantile plot) to visually assess whether residuals follow a normal distribution. Formal tests like Shapiro-Wilk or Kolmogorov-Smirnov can supplement this, though they can be overly sensitive with large samples.
Autocorrelation: The Durbin-Watson test checks whether residuals are correlated with each other (common in time series data). You can also examine ACF and PACF plots. Autocorrelated errors lead to biased standard errors and unreliable inference.
Influential observations: Use leverage values to identify points with unusual predictor values, Cook's distance to measure how much the entire model changes when a point is removed, and DFFITS for observation-level influence. A single influential point can dramatically alter your results, so always investigate these.

Linear Regression Strategies

Subset Selection and Regularization

This section pulls together the specific algorithms for choosing predictors:

Best subset selection is the most comprehensive approach. For $p$ predictors, it evaluates $2^p$ models. That's fine when $p$ is small (say, under 20), but becomes infeasible for larger predictor sets.
Forward stepwise selection is computationally cheaper. It fits $1 + p(p+1)/2$ models instead of $2^p$ . The trade-off is that it doesn't guarantee finding the globally best model since it can't revisit earlier decisions.
Backward elimination works similarly but in reverse. It requires that $n > p$ (more observations than predictors), which forward selection does not.
Ridge regression penalizes the sum of squared coefficients ( $L_2$ penalty), shrinking them toward zero. It handles multicollinearity well but keeps all predictors in the model.
Lasso penalizes the sum of absolute coefficients ( $L_1$ penalty). Its key advantage is that it can zero out coefficients entirely, performing automatic variable selection.
Elastic net uses a weighted combination of $L_1$ and $L_2$ penalties. It tends to outperform lasso when groups of predictors are correlated, because lasso might arbitrarily select one from a correlated group while elastic net keeps the group together.

Dimension Reduction Techniques

When predictors are highly correlated, another strategy is to reduce the predictor space before fitting the model.

Principal Component Regression (PCR):

Compute principal components of the predictor matrix. These are uncorrelated linear combinations of the original predictors, ordered by how much variance they capture.
Fit a linear model using the first several principal components as predictors instead of the original variables.
The limitation: principal components maximize variance in $X$ , not covariance with $Y$ . A component that explains a lot of predictor variance might be irrelevant to the response.

Partial Least Squares Regression (PLSR):

Finds latent variables that maximize the covariance between the predictors and the response, making it more targeted than PCR.
Often performs better than PCR when the goal is prediction, because the dimension reduction is guided by the response variable.

For both methods, use cross-validation to determine how many components or latent variables to retain. Too few and you underfit; too many and you lose the benefit of dimension reduction.

2,589 studying →