🧠Machine Learning Engineering

Feature Selection Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Feature selection sits at the heart of building models that actually work in production. You're being tested on your ability to recognize when to apply each technique, why certain methods outperform others in specific contexts, and how feature selection connects to broader concepts like the bias-variance tradeoff, model interpretability, and computational efficiency. These aren't just preprocessing steps—they're strategic decisions that determine whether your model generalizes or fails spectacularly on new data.

The techniques below demonstrate core principles: statistical dependence, regularization theory, information theory, and iterative optimization. Don't just memorize what each method does—know what problem it solves, what assumptions it makes, and when it breaks down. That's what separates engineers who can debug pipelines from those who just copy-paste from tutorials.

Filter Methods: Statistical Screening Before Modeling

Filter methods evaluate features independently of any model, using statistical properties to rank or eliminate candidates. They're fast and model-agnostic, but they ignore feature interactions and can miss complex relationships.

Variance Threshold

Removes features with variance below a set cutoff—if a feature barely changes across samples, it can't help distinguish outcomes
Zero computational cost relative to model-based methods—operates purely on summary statistics, making it ideal as a first-pass filter
Assumes constant features are uninformative—fails when rare values carry signal (e.g., fraud detection where the interesting class is sparse)

Correlation-based Feature Selection

Ranks features by their linear relationship to the target—typically using Pearson's $r$ for continuous variables
Penalizes inter-feature correlation to reduce multicollinearity, selecting features that are predictive but not redundant
Blind to non-linear relationships—a feature with zero linear correlation might still be highly predictive through interactions or transformations

Chi-squared Test

Tests independence between categorical features and categorical targets—computes $\chi^2 = \sum \frac{(O - E)^2}{E}$ where $O$ is observed and $E$ is expected frequency
Requires non-negative feature values—commonly applied after discretization or to count-based features
Only valid for classification tasks with categorical inputs—won't help you with continuous predictors or regression problems

Mutual Information

Quantifies information shared between feature and target—measured in bits, capturing both linear and non-linear dependencies
Model-free and assumption-free—doesn't require normality, linearity, or any specific distribution
Computationally expensive for continuous variables—requires density estimation or binning, and results depend heavily on hyperparameter choices

Compare: Correlation vs. Mutual Information—both measure feature-target relationships, but correlation only captures linear dependence while mutual information detects any statistical relationship. If you suspect non-linear patterns, mutual information is your diagnostic tool; for quick screening with interpretable coefficients, correlation wins.

Wrapper Methods: Let the Model Decide

Wrapper methods use actual model performance to evaluate feature subsets. They're powerful but computationally expensive, essentially treating feature selection as a search problem.

Forward Feature Selection

Starts empty and greedily adds the best feature at each step—evaluates model performance with each candidate addition
Stops when adding features no longer improves validation metrics—or when you hit a predefined feature budget
Computationally scales as $O(n \cdot k)$ where $n$ is total features and $k$ is selected features—cheaper than exhaustive search but still expensive for wide datasets

Backward Feature Elimination

Starts with all features and removes the weakest iteratively—measures performance drop from each removal
Better at catching feature interactions than forward selection since it starts with the full context
Infeasible when $n >> m$ (features exceed samples)—you can't fit a model with all features to begin with

Recursive Feature Elimination (RFE)

Trains a model, ranks features by importance, and prunes the bottom tier—repeats until reaching target feature count
Leverages model-specific importance metrics—coefficient magnitudes for linear models, impurity reduction for trees
Requires a model with built-in feature ranking—works beautifully with SVMs and linear regression, less naturally with neural networks

Compare: Forward Selection vs. Backward Elimination—forward is cheaper and works when you can't fit all features, backward better preserves interaction effects. In interview settings, know that backward elimination with cross-validation is more robust but forward selection is your fallback for ultra-high-dimensional data.

Embedded Methods: Selection During Training

Embedded methods perform feature selection as part of the model training process itself. They balance the efficiency of filters with the performance awareness of wrappers.

Lasso Regularization

Adds an $L_1$ penalty $\lambda \sum |w_i|$ to the loss function—this geometry encourages sparse solutions where some weights hit exactly zero
Automatic feature selection through coefficient shrinkage—features with zero coefficients are effectively removed
Struggles with correlated features—tends to arbitrarily select one from a correlated group, which hurts interpretability when you care about all relevant predictors

Random Forest Feature Importance

Ranks features by mean decrease in impurity (Gini or entropy) averaged across all trees and splits
Handles non-linear relationships and interactions natively—no assumptions about feature distributions
Biased toward high-cardinality features—categorical variables with many levels get artificially inflated importance scores; use permutation importance for unbiased estimates

Compare: Lasso vs. Random Forest Importance—Lasso gives you a sparse linear model in one shot, while RF importance tells you what matters without forcing linearity. For interpretable models with automatic selection, Lasso wins; for understanding complex feature contributions before choosing a final model, RF importance is your exploratory tool.

Dimensionality Reduction: Transform, Don't Select

Dimensionality reduction creates new features rather than selecting existing ones. It's technically distinct from feature selection but solves similar problems—reducing input dimensionality while preserving signal.

Principal Component Analysis (PCA)

Projects data onto orthogonal axes that maximize variance—the first principal component captures the direction of greatest spread, and so on
Creates uncorrelated features by construction—eliminates multicollinearity entirely, which stabilizes many downstream models
Destroys interpretability—principal components are linear combinations of original features, making it impossible to say "feature X drove this prediction"

Compare: PCA vs. Lasso—both reduce effective dimensionality, but PCA transforms features while Lasso selects them. Use PCA when you need decorrelated inputs and don't care about feature interpretability; use Lasso when stakeholders need to understand which original variables matter.

Quick Reference Table

Concept	Best Examples
Fast statistical screening	Variance Threshold, Correlation, Chi-squared
Non-linear relationship detection	Mutual Information, Random Forest Importance
Automatic sparsity during training	Lasso Regularization
Model-performance-driven search	RFE, Forward Selection, Backward Elimination
Handling high-cardinality categoricals	Chi-squared, Mutual Information
Multicollinearity reduction	Correlation-based Selection, PCA
Works with any model type	Filter methods (Variance, Correlation, MI)
Requires interpretable features	Lasso, Correlation-based (avoid PCA)

Self-Check Questions

You have a dataset with 10,000 features and 500 samples. Which selection strategies are even feasible, and why do backward elimination and Lasso fail here?
Compare mutual information and Pearson correlation: if a feature has $r = 0$ but high mutual information with the target, what does that tell you about the relationship?
A stakeholder asks which features drive your model's predictions. You used PCA for dimensionality reduction. What's the problem, and what alternative would you recommend?
When would you choose RFE over Lasso regularization? Consider computational cost, model flexibility, and the type of importance signal each provides.
You notice Random Forest feature importance ranks a categorical variable with 50 levels as most important, but domain experts are skeptical. What's likely happening, and how would you validate the result?

🧠Machine Learning Engineering

Feature Selection Techniques

Why This Matters

Filter Methods: Statistical Screening Before Modeling

Variance Threshold

Correlation-based Feature Selection

Chi-squared Test

Mutual Information

Wrapper Methods: Let the Model Decide

Forward Feature Selection

Backward Feature Elimination

Recursive Feature Elimination (RFE)

Embedded Methods: Selection During Training

Lasso Regularization

Random Forest Feature Importance

Dimensionality Reduction: Transform, Don't Select

Principal Component Analysis (PCA)

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes