upgrade
upgrade

🤝Collaborative Data Science

Feature Selection Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Feature selection sits at the heart of building models that are both powerful and interpretable—two qualities that often seem at odds with each other. In collaborative data science workflows, choosing the right features isn't just about squeezing out better predictions; it's about creating analyses that your teammates can understand, reproduce, and build upon. When you reduce dimensionality thoughtfully, you're also reducing the risk of overfitting, multicollinearity, and computational bloat that can derail reproducible research.

You're being tested on your ability to match methods to data types and goals. Can you distinguish between filter methods that work independently of any model and wrapper methods that evaluate features through model performance? Do you know when regularization handles feature selection implicitly versus when you need explicit elimination? Don't just memorize method names—know what problem each method solves and what assumptions it makes about your data.


Filter Methods: Score Features Before Modeling

Filter methods evaluate features based on statistical properties before any model is trained. They're fast, scalable, and model-agnostic—making them ideal first passes in a reproducible pipeline.

Variance Threshold

  • Removes low-variance features—if a feature barely changes across observations, it can't help distinguish outcomes
  • Zero computational cost relative to model-based methods; applies a simple statistical cutoff
  • Best used as preprocessing before applying more sophisticated selection techniques

Correlation-based Feature Selection

  • Measures linear relationships between each feature and the target using Pearson's rr or similar metrics
  • Filters for high target correlation, low feature-feature correlation—reducing redundancy while preserving signal
  • Directly addresses multicollinearity, which can destabilize coefficient estimates in linear models

Chi-squared Test

  • Tests independence between categorical features and categorical targets using the χ2\chi^2 statistic
  • Requires non-negative values, so it's typically applied to count data or one-hot encoded features
  • Standard choice for text classification and other problems with discrete feature spaces

Mutual Information

  • Quantifies information gain about the target from knowing a feature's value: I(X;Y)I(X; Y)
  • Captures nonlinear relationships that correlation-based methods miss entirely
  • Works for both classification and regression, making it versatile across problem types

Compare: Correlation vs. Mutual Information—both measure feature-target relationships, but correlation only detects linear patterns while mutual information captures any statistical dependence. If an FRQ asks about nonlinear feature relationships, mutual information is your go-to example.


Wrapper Methods: Let the Model Decide

Wrapper methods use a specific model's performance to evaluate feature subsets. They're more computationally expensive but often yield better results because they account for feature interactions that filter methods ignore.

Forward Feature Selection

  • Starts empty, adds greedily—each iteration includes the feature that most improves model performance
  • Evaluates O(nk)O(n \cdot k) models where nn is features and kk is selected features, making it expensive for large feature sets
  • Produces minimal feature sets, useful when interpretability and simplicity are priorities

Backward Feature Elimination

  • Starts full, removes iteratively—drops the feature whose removal least hurts (or most helps) performance
  • Computationally heavier than forward selection since it begins with all features in the model
  • Better at preserving feature interactions that might be missed when building up from nothing

Recursive Feature Elimination (RFE)

  • Ranks features by model coefficients or importance scores, then removes the weakest iteratively
  • Requires a base estimator like linear regression, SVM, or random forest that provides feature rankings
  • Cross-validated variants (RFECV) automatically determine the optimal number of features

Compare: Forward Selection vs. Backward Elimination—forward is faster and finds sparse solutions; backward better preserves complex interactions but scales poorly. Choose forward when you suspect few features matter; choose backward when you're unsure what to cut.


Embedded Methods: Selection During Training

Embedded methods perform feature selection as part of the model fitting process. They balance the efficiency of filters with the model-awareness of wrappers.

Lasso (L1 Regularization)

  • Adds L1L_1 penalty to the loss function: Loss+λwi\text{Loss} + \lambda \sum |w_i|, which encourages sparse coefficients
  • Drives coefficients exactly to zero, effectively removing features from the model automatically
  • Controlled by regularization strength λ\lambda—higher values yield sparser models with fewer features

Random Forest Feature Importance

  • Measures importance via mean decrease in impurity (Gini or entropy) across all trees in the ensemble
  • Handles mixed feature types (continuous, categorical, ordinal) without preprocessing
  • Permutation importance variant offers a more robust alternative by measuring accuracy drop when features are shuffled

Compare: Lasso vs. Random Forest Importance—Lasso performs selection and estimation simultaneously in a linear framework; Random Forest provides importance rankings that require a separate threshold decision. Lasso assumes linear relationships; Random Forest captures nonlinear patterns.


Dimensionality Reduction: Transform, Don't Select

Unlike true feature selection, dimensionality reduction creates new features that compress information from the originals. The distinction matters for interpretability.

Principal Component Analysis (PCA)

  • Projects data onto orthogonal axes that maximize variance: the first principal component captures the most, then the second, and so on
  • Components are linear combinations of original features, meaning individual feature interpretations are lost
  • Requires standardization since PCA is sensitive to feature scales; always center and scale first

Compare: PCA vs. Lasso—PCA transforms features into new uncorrelated components, sacrificing interpretability for variance retention. Lasso keeps original features but zeros out the unimportant ones. Use PCA for compression and visualization; use Lasso when you need to know which original features matter.


Quick Reference Table

ConceptBest Examples
Filter methods (model-agnostic)Variance Threshold, Correlation, Chi-squared, Mutual Information
Wrapper methods (model-dependent)Forward Selection, Backward Elimination, RFE
Embedded methods (during training)Lasso, Random Forest Importance
Handles nonlinear relationshipsMutual Information, Random Forest Importance
Best for categorical featuresChi-squared Test
Produces sparse linear modelsLasso (L1 Regularization)
Dimensionality reduction (not selection)PCA
Computationally cheapestVariance Threshold, Correlation

Self-Check Questions

  1. Which two methods can capture nonlinear feature-target relationships, and how do their outputs differ?

  2. You have a dataset with 500 features and need to build an interpretable linear model. Compare the tradeoffs between using Lasso versus RFE with a linear base estimator.

  3. A colleague argues that PCA "selected" the 10 most important features. What's wrong with this framing, and how would you explain the distinction?

  4. For a classification problem with categorical predictors, which filter method is most appropriate and what assumption does it require about the data?

  5. Compare forward feature selection and backward feature elimination: under what circumstances would each be preferred, and what computational considerations apply?