Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Feature selection sits at the heart of building models that are both powerful and interpretable—two qualities that often seem at odds with each other. In collaborative data science workflows, choosing the right features isn't just about squeezing out better predictions; it's about creating analyses that your teammates can understand, reproduce, and build upon. When you reduce dimensionality thoughtfully, you're also reducing the risk of overfitting, multicollinearity, and computational bloat that can derail reproducible research.
You're being tested on your ability to match methods to data types and goals. Can you distinguish between filter methods that work independently of any model and wrapper methods that evaluate features through model performance? Do you know when regularization handles feature selection implicitly versus when you need explicit elimination? Don't just memorize method names—know what problem each method solves and what assumptions it makes about your data.
Filter methods evaluate features based on statistical properties before any model is trained. They're fast, scalable, and model-agnostic—making them ideal first passes in a reproducible pipeline.
Compare: Correlation vs. Mutual Information—both measure feature-target relationships, but correlation only detects linear patterns while mutual information captures any statistical dependence. If an FRQ asks about nonlinear feature relationships, mutual information is your go-to example.
Wrapper methods use a specific model's performance to evaluate feature subsets. They're more computationally expensive but often yield better results because they account for feature interactions that filter methods ignore.
Compare: Forward Selection vs. Backward Elimination—forward is faster and finds sparse solutions; backward better preserves complex interactions but scales poorly. Choose forward when you suspect few features matter; choose backward when you're unsure what to cut.
Embedded methods perform feature selection as part of the model fitting process. They balance the efficiency of filters with the model-awareness of wrappers.
Compare: Lasso vs. Random Forest Importance—Lasso performs selection and estimation simultaneously in a linear framework; Random Forest provides importance rankings that require a separate threshold decision. Lasso assumes linear relationships; Random Forest captures nonlinear patterns.
Unlike true feature selection, dimensionality reduction creates new features that compress information from the originals. The distinction matters for interpretability.
Compare: PCA vs. Lasso—PCA transforms features into new uncorrelated components, sacrificing interpretability for variance retention. Lasso keeps original features but zeros out the unimportant ones. Use PCA for compression and visualization; use Lasso when you need to know which original features matter.
| Concept | Best Examples |
|---|---|
| Filter methods (model-agnostic) | Variance Threshold, Correlation, Chi-squared, Mutual Information |
| Wrapper methods (model-dependent) | Forward Selection, Backward Elimination, RFE |
| Embedded methods (during training) | Lasso, Random Forest Importance |
| Handles nonlinear relationships | Mutual Information, Random Forest Importance |
| Best for categorical features | Chi-squared Test |
| Produces sparse linear models | Lasso (L1 Regularization) |
| Dimensionality reduction (not selection) | PCA |
| Computationally cheapest | Variance Threshold, Correlation |
Which two methods can capture nonlinear feature-target relationships, and how do their outputs differ?
You have a dataset with 500 features and need to build an interpretable linear model. Compare the tradeoffs between using Lasso versus RFE with a linear base estimator.
A colleague argues that PCA "selected" the 10 most important features. What's wrong with this framing, and how would you explain the distinction?
For a classification problem with categorical predictors, which filter method is most appropriate and what assumption does it require about the data?
Compare forward feature selection and backward feature elimination: under what circumstances would each be preferred, and what computational considerations apply?