upgrade
upgrade

👩‍💻Foundations of Data Science

Key Classification Models

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Classification is the backbone of predictive analytics—it's how we teach machines to make decisions. Whether you're building a spam filter, diagnosing diseases from medical images, or predicting customer churn, you're using classification. The models in this guide represent fundamentally different approaches to the same problem: how do we draw boundaries between categories in data? Understanding these differences is what separates someone who can run code from someone who can choose the right tool for the job.

You're being tested on more than just knowing what each algorithm does. Exam questions will ask you to compare trade-offs, explain when to use which model, and identify why certain algorithms fail in specific scenarios. Don't just memorize definitions—know what problem each model solves best, what assumptions it makes, and how it handles the messy realities of real-world data. Master the why, and the what follows naturally.


Linear & Probabilistic Models

These models assume relatively simple mathematical relationships between features and outcomes. They're fast, interpretable, and often your first line of attack—but they struggle when the true decision boundary is complex or when their assumptions don't hold.

Logistic Regression

  • Predicts probabilities for binary outcomes using the logistic (sigmoid) function: P(y=1)=11+ezP(y=1) = \frac{1}{1 + e^{-z}}, where zz is a linear combination of features
  • Coefficients represent log-odds changes—a one-unit increase in a predictor shifts the log-odds of the outcome by that coefficient's value
  • Assumes linear decision boundaries in feature space; extend to multiclass problems using one-vs-all or softmax regression

Naive Bayes

  • Applies Bayes' theorem with a strong independence assumption—it assumes all features contribute independently to the probability of each class
  • Excels at text classification tasks like spam detection and sentiment analysis, where the independence assumption is "naive" but surprisingly effective
  • Trains extremely fast even on large datasets; variants include Gaussian (continuous data), Multinomial (count data), and Bernoulli (binary features)

Compare: Logistic Regression vs. Naive Bayes—both produce probability estimates, but logistic regression learns feature weights directly while Naive Bayes estimates class-conditional probabilities. Use Naive Bayes when you have limited training data or need speed; choose logistic regression when feature interactions matter.


Tree-Based Models

Tree-based models partition the feature space into regions using sequential splits. They're intuitive, handle mixed data types naturally, and form the foundation for some of the most powerful ensemble methods. The core trade-off: individual trees are interpretable but prone to overfitting; ensembles are powerful but less transparent.

Decision Trees

  • Splits data recursively based on feature thresholds—each internal node tests a condition, each branch represents an outcome, and each leaf assigns a class
  • Highly interpretable and visualization-friendly—you can trace exactly why a prediction was made, which matters for explainability requirements
  • Prone to overfitting on training data; control complexity through pruning, max depth limits, or minimum samples per leaf

Random Forests

  • Combines many decision trees trained on random data subsets—each tree sees a bootstrapped sample and a random subset of features at each split
  • Reduces overfitting through averaging—individual tree errors cancel out, and majority voting produces robust classifications
  • Provides feature importance scores by measuring how much each variable contributes to reducing impurity across all trees

Gradient Boosting Machines

  • Builds trees sequentially to correct previous errors—each new tree fits the residual errors of the ensemble so far, gradually improving predictions
  • Highly flexible with customizable loss functions—can optimize accuracy, log-loss, or domain-specific objectives
  • Requires careful tuning to prevent overfitting—use learning rate, tree depth limits, and regularization; popular implementations include XGBoost, LightGBM, and CatBoost

Compare: Random Forests vs. Gradient Boosting—both are tree ensembles, but Random Forests build trees independently (parallel), while Gradient Boosting builds them sequentially (each correcting the last). Random Forests are more robust out-of-the-box; Gradient Boosting often achieves higher accuracy with proper tuning. If an FRQ asks about ensemble methods, distinguish between bagging (Random Forests) and boosting (GBM).


Distance & Margin-Based Models

These models make decisions based on geometric relationships in feature space—either by measuring distances to known examples or by finding optimal separating boundaries. They often require careful feature scaling and can be sensitive to irrelevant dimensions.

K-Nearest Neighbors (KNN)

  • Classifies by majority vote of the kk closest training examples—no explicit model is learned; predictions depend entirely on stored data (instance-based learning)
  • Choice of kk controls the bias-variance trade-off—small kk captures local patterns but is noise-sensitive; large kk smooths decisions but may miss detail
  • Requires feature scaling and thoughtful distance metrics—Euclidean distance is standard, but Manhattan or custom metrics may suit specific problems; computationally expensive at prediction time for large datasets

Support Vector Machines (SVM)

  • Finds the hyperplane that maximizes margin between classes—the decision boundary is positioned to be as far as possible from the nearest points of each class (support vectors)
  • Handles non-linear boundaries using kernel functions—the kernel trick maps data to higher dimensions where linear separation becomes possible (common kernels: RBF, polynomial)
  • Effective in high-dimensional spaces like text and image classification, but sensitive to hyperparameter choices (CC, kernel type, γ\gamma) and doesn't scale well to very large datasets

Compare: KNN vs. SVM—both rely on geometric relationships, but KNN makes local decisions based on nearby points while SVM finds a global optimal boundary. KNN is simple but slow at prediction; SVM is complex to tune but efficient once trained. Use KNN for quick prototyping; use SVM when you need strong generalization in high-dimensional data.


Neural Network Models

Neural networks learn hierarchical representations through layers of interconnected nodes. They're the most flexible classification tools available—capable of modeling arbitrarily complex relationships, but at the cost of interpretability, data requirements, and computational resources.

Neural Networks

  • Composed of layers of neurons with learnable weights—input flows through hidden layers, with each neuron applying a weighted sum followed by a non-linear activation function (ReLU, sigmoid, softmax)
  • Excel at high-dimensional, unstructured data—state-of-the-art for image classification, speech recognition, and natural language processing where feature engineering is impractical
  • Require substantial data and compute resources—architecture choices (depth, width, activation functions) dramatically affect performance; transfer learning allows leveraging pre-trained models for new tasks

Compare: Gradient Boosting vs. Neural Networks—for tabular/structured data, gradient boosting often matches or beats neural networks with less tuning. Neural networks dominate when data is unstructured (images, text, audio) or when you have massive datasets. If asked which to choose for a classification task, consider data type, dataset size, and interpretability requirements.


Quick Reference Table

ConceptBest Examples
Probabilistic/Linear ModelsLogistic Regression, Naive Bayes
Tree-Based Single ModelsDecision Trees
Bagging EnsemblesRandom Forests
Boosting EnsemblesGradient Boosting Machines (XGBoost, LightGBM)
Instance-Based LearningK-Nearest Neighbors
Margin MaximizationSupport Vector Machines
Deep LearningNeural Networks
Best for Text ClassificationNaive Bayes, SVM, Neural Networks
Most InterpretableDecision Trees, Logistic Regression

Self-Check Questions

  1. Which two models both produce probability estimates but make fundamentally different assumptions about how features relate to outcomes? What is each model's core assumption?

  2. Explain the difference between bagging and boosting as ensemble strategies. Which model in this guide uses each approach, and how does this affect their sensitivity to overfitting?

  3. A colleague suggests using KNN for a dataset with 500 features and 10 million records. What two major problems would you warn them about, and which alternative model might you recommend?

  4. Compare and contrast how Decision Trees and SVMs create decision boundaries. Which is more interpretable? Which handles non-linear relationships more naturally without modification?

  5. You're building a classifier for medical images with 100,000 labeled examples. Rank these three options—Logistic Regression, Random Forests, Neural Networks—from least to most appropriate, and justify your ranking based on data type and dataset size.