Variable selection is about deciding which independent variables belong in your regression model. Get this wrong, and your estimates can be biased, imprecise, or misleading. Get it right, and you end up with a parsimonious model: one that captures the essential relationships without unnecessary clutter.

The core tension is straightforward. Leave out a variable that matters, and you get biased coefficients. Throw in variables that don't matter, and you lose precision. The strategies covered here help you navigate that tradeoff.

Consequences of Omitted Variables

Omitted variable bias (OVB) occurs when you leave out a variable that is correlated with both the dependent variable and at least one included independent variable. When this happens, the coefficients on your included variables absorb the effect of the missing variable, making them biased and inconsistent.

For example, suppose you're estimating the relationship between income and job satisfaction but you leave out education level. If education is correlated with both income and job satisfaction, the coefficient on income will partly reflect the effect of education. Your estimate of income's effect will be wrong.

Two conditions must both hold for OVB to occur:

The omitted variable is correlated with the dependent variable (it actually affects $Y$ )
The omitted variable is correlated with one or more included independent variables (it's not orthogonal to your $X$ 's)

If either condition fails, omitting the variable doesn't cause bias. This is worth remembering because not every excluded variable creates a problem.

Consequences of Irrelevant Variables

Reduced Precision of Estimates

Including variables that have no real relationship with the dependent variable won't bias your coefficients, but it will hurt their precision. Each unnecessary variable eats up a degree of freedom and inflates the standard errors of your other estimates.

Larger standard errors mean wider confidence intervals and lower t-statistics, making it harder to detect effects that genuinely exist. Think of it as adding noise to your model.

For instance, including "number of pets owned" in a housing price model probably has no real effect on price. But its presence increases the standard errors on variables that do matter, like square footage and location.

Multicollinearity Issues

Irrelevant variables can also introduce or worsen multicollinearity, which is high correlation among independent variables. When predictors are highly correlated with each other, the model struggles to separate their individual effects.

Consequences of multicollinearity include:

Unstable coefficient estimates that swing dramatically with small data changes
Inflated standard errors
Difficulty interpreting which variable is driving the result

For example, including both the number of bedrooms and the number of bathrooms in a housing price model can create multicollinearity if those two variables move closely together. The model can't easily tell which one is doing the work.

Strategies for Variable Selection

Prior Knowledge and Theory

Your first and most important tool is economic theory. Before running any statistical procedure, think about what variables should matter based on established models and prior empirical work.

For example, when modeling economic growth, growth theory points you toward investment rates, human capital measures, and technological progress. Starting from theory keeps you grounded and helps you justify your choices to readers.

Theory also helps you avoid two common mistakes: including variables just because they're available, and omitting variables you know are important simply because they're hard to measure.

Reduced precision of estimates, Chapter 8. Regression Basics – Introductory Business Statistics with Interactive Spreadsheets ...

Stepwise Selection Methods

Stepwise methods are data-driven approaches that add or remove variables based on statistical criteria. They're useful as a complement to theory, but they have well-known limitations (they can overfit, and results can depend on the order variables enter the model). Use them with caution.

Forward Selection

Start with an empty model (no predictors).
Test each candidate variable individually and add the one that improves model fit the most (e.g., the largest increase in $R^2$ or the lowest p-value).
With that variable in the model, test each remaining candidate and add the next best one.
Stop when no remaining variable significantly improves the model, or when you hit a predetermined stopping rule.

Example: Predicting consumer spending, forward selection might first add income, then age, then education, stopping once additional variables no longer contribute meaningfully.

Backward Elimination

Start with a full model containing all candidate predictors.
Identify the variable with the highest p-value (least statistically significant).
If that p-value exceeds your threshold (e.g., 0.10), remove the variable.
Re-estimate the model and repeat until all remaining variables are statistically significant.

This approach is often preferred when you have a manageable number of predictors, since it starts by considering all variables together and accounts for their joint effects.

Stepwise Regression

Stepwise regression combines both approaches. At each iteration, it can add the most significant variable and remove any variable that has become insignificant after the new addition. This flexibility helps because adding a new variable can change the significance of variables already in the model.

The process continues until no variables qualify for addition or removal based on your chosen significance thresholds.

Information Criteria Approaches

Information criteria offer a more principled way to compare models than stepwise p-value testing. They balance goodness of fit against model complexity in a single number.

Akaike Information Criterion (AIC)

AIC estimates the relative information lost by a given model. You calculate it for each candidate model and choose the one with the lowest AIC. The formula penalizes additional parameters, so adding a variable only helps your AIC if it improves fit enough to offset the complexity penalty.

AIC works well for prediction-focused modeling, but it can sometimes favor slightly more complex models.

Bayesian Information Criterion (BIC)

BIC is similar to AIC but applies a stronger penalty for additional parameters, especially as sample size grows. This means BIC tends to select simpler models than AIC.

AIC vs. BIC: Both use lower values to indicate better models. AIC is more lenient with extra variables; BIC punishes complexity more heavily. If you're prioritizing parsimony, lean toward BIC. If you're prioritizing predictive accuracy, AIC may be more appropriate.

Regularization Techniques

Regularization methods modify the OLS estimation by adding a penalty for large coefficients. This shrinks estimates and can improve both stability and prediction.

Ridge Regression

Ridge regression adds a penalty proportional to the squared magnitude of the coefficients (the $L_2$ norm) to the OLS objective function. This shrinks all coefficients toward zero but never sets any exactly to zero.

Ridge is especially helpful when you have multicollinearity. By shrinking correlated coefficients, it stabilizes estimates that would otherwise be erratic under standard OLS.

Lasso Regression

Lasso (Least Absolute Shrinkage and Selection Operator) uses a penalty proportional to the absolute value of the coefficients (the $L_1$ norm). The key difference from ridge: lasso can shrink some coefficients all the way to exactly zero, effectively performing variable selection automatically.

This makes lasso particularly useful when you suspect many of your candidate predictors are irrelevant and you want the model to identify the important ones for you.

Ridge vs. Lasso: Ridge shrinks coefficients but keeps all variables. Lasso shrinks and eliminates variables. If you want a sparse model with fewer predictors, lasso is the better choice. If you want to keep all predictors but reduce instability from multicollinearity, use ridge.

Cross-Validation for Model Selection

Cross-validation helps you evaluate how well a model will perform on data it hasn't seen. This is critical because a model can fit your current data well but predict poorly on new observations (overfitting).

The most common approach is k-fold cross-validation:

Split your data into $k$ equally sized subsets (folds). A common choice is $k = 5$ or $k = 10$ .
Train the model on $k - 1$ folds and test it on the remaining fold.
Repeat this process $k$ times, each time holding out a different fold.
Average the prediction error across all $k$ rounds.

The model (or variable selection method) with the lowest average prediction error is preferred. Cross-validation is especially valuable when comparing models built using different selection strategies.

Challenges in Variable Selection

Collinearity Among Predictors

Even after careful selection, your chosen variables may still be correlated with each other. Collinearity doesn't violate OLS assumptions, but it makes individual coefficient estimates unreliable and sensitive to small changes in the data.

You can detect collinearity by examining correlation matrices or computing variance inflation factors (VIFs). A common rule of thumb is that VIF values above 5 or 10 suggest problematic collinearity.

Reduced precision of estimates, The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and ...

High-Dimensional Data

When the number of candidate predictors is large relative to your sample size, traditional methods like stepwise regression tend to break down. The risk of overfitting increases dramatically because the model has too many opportunities to fit noise rather than signal.

Regularization techniques (ridge, lasso) and cross-validation become especially important in these settings. In extreme cases, such as datasets where predictors outnumber observations, standard OLS can't even be estimated, and regularization is required.

Interpreting Selected Models

Coefficient Interpretation

Once you've selected your variables, interpreting the coefficients follows the standard rules. In a linear model, each coefficient represents the expected change in $Y$ for a one-unit change in that predictor, holding all other included variables constant.

For example, if a salary model yields a coefficient of 5,000 on years of education, that means each additional year of education is associated with a $5,000 increase in salary, controlling for the other variables in the model (like experience).

Always pay attention to units and scale. A coefficient of 0.002 might look tiny, but if the variable is measured in millions of dollars, the effect is substantial.

Model Fit and Predictive Power

Common measures for evaluating your selected model:

$R^2$ (coefficient of determination): The proportion of variation in $Y$ explained by the model. An $R^2$ of 0.70 means 70% of the variation is accounted for.
Adjusted $R^2$ : Adjusts $R^2$ for the number of predictors. Unlike regular $R^2$ , it can decrease when you add irrelevant variables, making it more useful for model comparison.
RMSE (root mean squared error): Measures average prediction error in the same units as $Y$ . Lower is better.

For assessing predictive power specifically, rely on out-of-sample methods like cross-validation rather than in-sample fit statistics, which can be misleadingly optimistic.

Best Practices in Variable Selection

Balancing Model Complexity and Interpretability

Simpler models are easier to explain, communicate, and defend. More complex models may fit the data better but can obscure the key relationships you're trying to understand.

The right balance depends on your goal. If you're writing a policy brief, a clean model with a few well-justified predictors is usually more persuasive. If you're building a pure forecasting tool, predictive accuracy may matter more than interpretability.

Reporting the Variable Selection Process

Transparency matters. When you write up your results, clearly describe:

What candidate variables you started with and why
Which selection method(s) you used
Whether your choices were driven by theory, data, or both
How sensitive your results are to alternative specifications

Running sensitivity analyses with different variable sets or selection methods strengthens your findings. If your main conclusions hold across multiple specifications, readers can be more confident in your results.