upgrade
upgrade

🧠Machine Learning Engineering

Feature Selection Techniques

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Feature selection sits at the heart of building models that actually work in production. You're being tested on your ability to recognize when to apply each technique, why certain methods outperform others in specific contexts, and how feature selection connects to broader concepts like the bias-variance tradeoff, model interpretability, and computational efficiency. These aren't just preprocessing steps—they're strategic decisions that determine whether your model generalizes or fails spectacularly on new data.

The techniques below demonstrate core principles: statistical dependence, regularization theory, information theory, and iterative optimization. Don't just memorize what each method does—know what problem it solves, what assumptions it makes, and when it breaks down. That's what separates engineers who can debug pipelines from those who just copy-paste from tutorials.


Filter Methods: Statistical Screening Before Modeling

Filter methods evaluate features independently of any model, using statistical properties to rank or eliminate candidates. They're fast and model-agnostic, but they ignore feature interactions and can miss complex relationships.

Variance Threshold

  • Removes features with variance below a set cutoff—if a feature barely changes across samples, it can't help distinguish outcomes
  • Zero computational cost relative to model-based methods—operates purely on summary statistics, making it ideal as a first-pass filter
  • Assumes constant features are uninformative—fails when rare values carry signal (e.g., fraud detection where the interesting class is sparse)

Correlation-based Feature Selection

  • Ranks features by their linear relationship to the target—typically using Pearson's rr for continuous variables
  • Penalizes inter-feature correlation to reduce multicollinearity, selecting features that are predictive but not redundant
  • Blind to non-linear relationships—a feature with zero linear correlation might still be highly predictive through interactions or transformations

Chi-squared Test

  • Tests independence between categorical features and categorical targets—computes χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E} where OO is observed and EE is expected frequency
  • Requires non-negative feature values—commonly applied after discretization or to count-based features
  • Only valid for classification tasks with categorical inputs—won't help you with continuous predictors or regression problems

Mutual Information

  • Quantifies information shared between feature and target—measured in bits, capturing both linear and non-linear dependencies
  • Model-free and assumption-free—doesn't require normality, linearity, or any specific distribution
  • Computationally expensive for continuous variables—requires density estimation or binning, and results depend heavily on hyperparameter choices

Compare: Correlation vs. Mutual Information—both measure feature-target relationships, but correlation only captures linear dependence while mutual information detects any statistical relationship. If you suspect non-linear patterns, mutual information is your diagnostic tool; for quick screening with interpretable coefficients, correlation wins.


Wrapper Methods: Let the Model Decide

Wrapper methods use actual model performance to evaluate feature subsets. They're powerful but computationally expensive, essentially treating feature selection as a search problem.

Forward Feature Selection

  • Starts empty and greedily adds the best feature at each step—evaluates model performance with each candidate addition
  • Stops when adding features no longer improves validation metrics—or when you hit a predefined feature budget
  • Computationally scales as O(nk)O(n \cdot k) where nn is total features and kk is selected features—cheaper than exhaustive search but still expensive for wide datasets

Backward Feature Elimination

  • Starts with all features and removes the weakest iteratively—measures performance drop from each removal
  • Better at catching feature interactions than forward selection since it starts with the full context
  • Infeasible when n>>mn >> m (features exceed samples)—you can't fit a model with all features to begin with

Recursive Feature Elimination (RFE)

  • Trains a model, ranks features by importance, and prunes the bottom tier—repeats until reaching target feature count
  • Leverages model-specific importance metrics—coefficient magnitudes for linear models, impurity reduction for trees
  • Requires a model with built-in feature ranking—works beautifully with SVMs and linear regression, less naturally with neural networks

Compare: Forward Selection vs. Backward Elimination—forward is cheaper and works when you can't fit all features, backward better preserves interaction effects. In interview settings, know that backward elimination with cross-validation is more robust but forward selection is your fallback for ultra-high-dimensional data.


Embedded Methods: Selection During Training

Embedded methods perform feature selection as part of the model training process itself. They balance the efficiency of filters with the performance awareness of wrappers.

Lasso Regularization

  • Adds an L1L_1 penalty λwi\lambda \sum |w_i| to the loss function—this geometry encourages sparse solutions where some weights hit exactly zero
  • Automatic feature selection through coefficient shrinkage—features with zero coefficients are effectively removed
  • Struggles with correlated features—tends to arbitrarily select one from a correlated group, which hurts interpretability when you care about all relevant predictors

Random Forest Feature Importance

  • Ranks features by mean decrease in impurity (Gini or entropy) averaged across all trees and splits
  • Handles non-linear relationships and interactions natively—no assumptions about feature distributions
  • Biased toward high-cardinality features—categorical variables with many levels get artificially inflated importance scores; use permutation importance for unbiased estimates

Compare: Lasso vs. Random Forest Importance—Lasso gives you a sparse linear model in one shot, while RF importance tells you what matters without forcing linearity. For interpretable models with automatic selection, Lasso wins; for understanding complex feature contributions before choosing a final model, RF importance is your exploratory tool.


Dimensionality Reduction: Transform, Don't Select

Dimensionality reduction creates new features rather than selecting existing ones. It's technically distinct from feature selection but solves similar problems—reducing input dimensionality while preserving signal.

Principal Component Analysis (PCA)

  • Projects data onto orthogonal axes that maximize variance—the first principal component captures the direction of greatest spread, and so on
  • Creates uncorrelated features by construction—eliminates multicollinearity entirely, which stabilizes many downstream models
  • Destroys interpretability—principal components are linear combinations of original features, making it impossible to say "feature X drove this prediction"

Compare: PCA vs. Lasso—both reduce effective dimensionality, but PCA transforms features while Lasso selects them. Use PCA when you need decorrelated inputs and don't care about feature interpretability; use Lasso when stakeholders need to understand which original variables matter.


Quick Reference Table

ConceptBest Examples
Fast statistical screeningVariance Threshold, Correlation, Chi-squared
Non-linear relationship detectionMutual Information, Random Forest Importance
Automatic sparsity during trainingLasso Regularization
Model-performance-driven searchRFE, Forward Selection, Backward Elimination
Handling high-cardinality categoricalsChi-squared, Mutual Information
Multicollinearity reductionCorrelation-based Selection, PCA
Works with any model typeFilter methods (Variance, Correlation, MI)
Requires interpretable featuresLasso, Correlation-based (avoid PCA)

Self-Check Questions

  1. You have a dataset with 10,000 features and 500 samples. Which selection strategies are even feasible, and why do backward elimination and Lasso fail here?

  2. Compare mutual information and Pearson correlation: if a feature has r=0r = 0 but high mutual information with the target, what does that tell you about the relationship?

  3. A stakeholder asks which features drive your model's predictions. You used PCA for dimensionality reduction. What's the problem, and what alternative would you recommend?

  4. When would you choose RFE over Lasso regularization? Consider computational cost, model flexibility, and the type of importance signal each provides.

  5. You notice Random Forest feature importance ranks a categorical variable with 50 levels as most important, but domain experts are skeptical. What's likely happening, and how would you validate the result?