Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Feature selection sits at the heart of building models that actually work in production. You're being tested on your ability to recognize when to apply each technique, why certain methods outperform others in specific contexts, and how feature selection connects to broader concepts like the bias-variance tradeoff, model interpretability, and computational efficiency. These aren't just preprocessing steps—they're strategic decisions that determine whether your model generalizes or fails spectacularly on new data.
The techniques below demonstrate core principles: statistical dependence, regularization theory, information theory, and iterative optimization. Don't just memorize what each method does—know what problem it solves, what assumptions it makes, and when it breaks down. That's what separates engineers who can debug pipelines from those who just copy-paste from tutorials.
Filter methods evaluate features independently of any model, using statistical properties to rank or eliminate candidates. They're fast and model-agnostic, but they ignore feature interactions and can miss complex relationships.
Compare: Correlation vs. Mutual Information—both measure feature-target relationships, but correlation only captures linear dependence while mutual information detects any statistical relationship. If you suspect non-linear patterns, mutual information is your diagnostic tool; for quick screening with interpretable coefficients, correlation wins.
Wrapper methods use actual model performance to evaluate feature subsets. They're powerful but computationally expensive, essentially treating feature selection as a search problem.
Compare: Forward Selection vs. Backward Elimination—forward is cheaper and works when you can't fit all features, backward better preserves interaction effects. In interview settings, know that backward elimination with cross-validation is more robust but forward selection is your fallback for ultra-high-dimensional data.
Embedded methods perform feature selection as part of the model training process itself. They balance the efficiency of filters with the performance awareness of wrappers.
Compare: Lasso vs. Random Forest Importance—Lasso gives you a sparse linear model in one shot, while RF importance tells you what matters without forcing linearity. For interpretable models with automatic selection, Lasso wins; for understanding complex feature contributions before choosing a final model, RF importance is your exploratory tool.
Dimensionality reduction creates new features rather than selecting existing ones. It's technically distinct from feature selection but solves similar problems—reducing input dimensionality while preserving signal.
Compare: PCA vs. Lasso—both reduce effective dimensionality, but PCA transforms features while Lasso selects them. Use PCA when you need decorrelated inputs and don't care about feature interpretability; use Lasso when stakeholders need to understand which original variables matter.
| Concept | Best Examples |
|---|---|
| Fast statistical screening | Variance Threshold, Correlation, Chi-squared |
| Non-linear relationship detection | Mutual Information, Random Forest Importance |
| Automatic sparsity during training | Lasso Regularization |
| Model-performance-driven search | RFE, Forward Selection, Backward Elimination |
| Handling high-cardinality categoricals | Chi-squared, Mutual Information |
| Multicollinearity reduction | Correlation-based Selection, PCA |
| Works with any model type | Filter methods (Variance, Correlation, MI) |
| Requires interpretable features | Lasso, Correlation-based (avoid PCA) |
You have a dataset with 10,000 features and 500 samples. Which selection strategies are even feasible, and why do backward elimination and Lasso fail here?
Compare mutual information and Pearson correlation: if a feature has but high mutual information with the target, what does that tell you about the relationship?
A stakeholder asks which features drive your model's predictions. You used PCA for dimensionality reduction. What's the problem, and what alternative would you recommend?
When would you choose RFE over Lasso regularization? Consider computational cost, model flexibility, and the type of importance signal each provides.
You notice Random Forest feature importance ranks a categorical variable with 50 levels as most important, but domain experts are skeptical. What's likely happening, and how would you validate the result?