upgrade
upgrade

📊Big Data Analytics and Visualization

Popular Machine Learning Models

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Machine learning models are the engines that power modern data science, and understanding them is essential for tackling any big data challenge. You're not just being tested on what each model does—you're being tested on when to use which model, why certain models outperform others in specific scenarios, and how different algorithms approach the same fundamental problem. The models in this guide represent distinct strategies for extracting patterns from data: some predict continuous values, others classify categories, some require labeled training data, and others discover hidden structures on their own.

The key to mastering this content is recognizing that every model makes trade-offs between interpretability and complexity, speed and accuracy, and flexibility and overfitting risk. When you encounter a problem on an exam or in practice, you need to match the model to the data characteristics and business requirements. Don't just memorize algorithms—know what type of problem each one solves and what assumptions it makes about your data.


Regression Models: Predicting Continuous Values

These models estimate numerical outcomes by learning relationships between input features and a target variable. They work by fitting mathematical functions to training data, then using those functions to predict new values.

Linear Regression

  • Models relationships using a linear equation y=mx+by = mx + b—the simplest approach to predicting continuous outcomes like sales revenue or temperature
  • Assumes linearity between variables, which means performance degrades significantly when the true relationship is curved or complex
  • Highly sensitive to outliers—a single extreme data point can dramatically skew your prediction line, making data cleaning critical

Logistic Regression

  • Predicts probability of categorical outcomes using the logistic function to output values between 0 and 1—despite the name, it's a classification model
  • Assumes linear relationship with log-odds, expressed as log(p1p)=β0+β1x\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x, where pp is the probability of the positive class
  • Extends to multiclass problems through one-vs-all or softmax approaches, making it versatile for real-world classification tasks

Compare: Linear Regression vs. Logistic Regression—both assume linear relationships and use similar mathematical foundations, but linear regression predicts continuous values while logistic regression predicts class probabilities. If an exam question asks about predicting "whether" something happens, think logistic; if it asks "how much," think linear.


Tree-Based Models: Learning Through Decisions

Tree-based algorithms split data into increasingly specific subgroups based on feature values. They mimic human decision-making by asking a series of yes/no questions, making them highly interpretable.

Decision Trees

  • Creates flowchart-like structures that split data at each node based on feature thresholds—easy to visualize and explain to non-technical stakeholders
  • Handles both numerical and categorical data without requiring extensive preprocessing or normalization
  • Prone to overfitting when trees grow too deep; pruning techniques and depth limits help control model complexity

Random Forests

  • Combines multiple decision trees through ensemble learning—each tree votes on the prediction, reducing variance and improving accuracy
  • Trains each tree on random data subsets (bootstrap aggregating or bagging), which prevents any single tree from memorizing noise
  • Provides feature importance scores automatically, helping you identify which variables drive predictions most strongly

Compare: Decision Trees vs. Random Forests—both use the same splitting logic, but a single decision tree is interpretable and fast while Random Forests sacrifice some interpretability for dramatically better generalization. Use decision trees when you need to explain every prediction; use Random Forests when accuracy matters most.


Distance and Boundary Models: Separating Classes in Space

These algorithms classify data by measuring distances or finding optimal boundaries in feature space. They treat each data point as a location in multi-dimensional space and make predictions based on geometric relationships.

Support Vector Machines (SVM)

  • Finds the optimal hyperplane that maximizes the margin between classes—the decision boundary sits as far as possible from both groups
  • Uses kernel functions to handle non-linear relationships by transforming data into higher dimensions where linear separation becomes possible
  • Excels in high-dimensional spaces where the number of features exceeds the number of samples, common in text classification and genomics

K-Nearest Neighbors (KNN)

  • Classifies based on majority vote of the kk closest training examples—a non-parametric approach that makes no assumptions about data distribution
  • Simple to implement and understand, requiring no training phase since it stores all data and computes distances at prediction time
  • Suffers from the curse of dimensionality—as features increase, distance metrics become less meaningful and performance degrades

Compare: SVM vs. KNN—both operate in feature space, but SVM learns a fixed decision boundary during training while KNN computes classifications on-the-fly using stored examples. SVM scales better to high dimensions; KNN is simpler but requires careful tuning of kk and distance metrics.


Probabilistic Models: Learning from Likelihood

Probabilistic algorithms use statistical principles to calculate the likelihood of outcomes. They apply probability theory—particularly Bayes' theorem—to make predictions based on observed evidence.

Naive Bayes

  • Applies Bayes' theorem P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} to calculate class probabilities given feature values
  • Assumes feature independence, meaning each predictor contributes independently to the outcome—a "naive" assumption that rarely holds but often works surprisingly well
  • Excels at text classification tasks like spam detection and sentiment analysis, where word frequencies serve as features

Neural Networks: Learning Complex Patterns

Neural networks use interconnected layers of artificial neurons to model highly complex, non-linear relationships. They learn hierarchical representations of data through backpropagation, adjusting connection weights to minimize prediction error.

Neural Networks

  • Organized in layers of neurons—input layer receives features, hidden layers extract patterns, output layer produces predictions
  • Particularly effective for unstructured data like images, audio, and text, where traditional feature engineering struggles to capture relevant patterns
  • Requires substantial data and compute power for training; techniques like dropout and regularization help prevent overfitting on smaller datasets

Compare: Naive Bayes vs. Neural Networks—both can tackle classification, but Naive Bayes is fast, interpretable, and works well with limited data, while Neural Networks require more resources but capture complex non-linear patterns. For text classification with clear features, try Naive Bayes first; for image recognition, neural networks are typically necessary.


Unsupervised Learning: Finding Hidden Structure

These algorithms discover patterns in data without labeled outcomes. They identify natural groupings or reduce complexity by finding the most informative representations of the data.

K-Means Clustering

  • Partitions data into kk clusters by iteratively assigning points to the nearest centroid and updating centroid positions until convergence
  • Requires specifying kk in advance—techniques like the elbow method or silhouette scores help determine the optimal number of clusters
  • Sensitive to initial centroid placement; running multiple initializations (k-means++) improves consistency and result quality

Principal Component Analysis (PCA)

  • Reduces dimensionality by projecting data onto principal components—the directions of maximum variance in the feature space
  • Preserves the most important information while eliminating redundant or noisy features, making subsequent analysis faster and more effective
  • Assumes linear relationships among features; non-linear alternatives like t-SNE or UMAP may capture complex structures better

Compare: K-Means vs. PCA—both are unsupervised, but they solve different problems. K-Means groups similar observations together (clustering), while PCA reduces the number of features while preserving variance (dimensionality reduction). Often used together: apply PCA first to reduce dimensions, then K-Means to cluster the simplified data.


Quick Reference Table

ConceptBest Examples
Predicting continuous valuesLinear Regression
Binary/multiclass classificationLogistic Regression, SVM, Naive Bayes, KNN
Ensemble methodsRandom Forests
High interpretabilityDecision Trees, Linear Regression, Naive Bayes
High-dimensional dataSVM, PCA, Neural Networks
Text classificationNaive Bayes, SVM
Image/speech recognitionNeural Networks
Unsupervised clusteringK-Means
Dimensionality reductionPCA
Instance-based learningKNN

Self-Check Questions

  1. Which two models both assume linear relationships but solve fundamentally different types of problems (regression vs. classification)?

  2. You have a dataset with 10,000 features but only 500 samples. Which two models from this guide would be strong candidates, and why?

  3. Compare and contrast Decision Trees and Random Forests: what does Random Forests gain by combining multiple trees, and what does it sacrifice?

  4. A data scientist needs to segment customers into distinct groups without any predefined labels. Which model should they use, and what key parameter must they specify before running it?

  5. If you needed to explain every prediction to a regulatory compliance team in plain language, which models would you prioritize and which would you avoid? Justify your reasoning based on interpretability trade-offs.