📊Big Data Analytics and Visualization

Popular Machine Learning Models

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Machine learning models are the engines that power modern data science, and understanding them is essential for tackling any big data challenge. You're not just being tested on what each model does—you're being tested on when to use which model, why certain models outperform others in specific scenarios, and how different algorithms approach the same fundamental problem. The models in this guide represent distinct strategies for extracting patterns from data: some predict continuous values, others classify categories, some require labeled training data, and others discover hidden structures on their own.

The key to mastering this content is recognizing that every model makes trade-offs between interpretability and complexity, speed and accuracy, and flexibility and overfitting risk. When you encounter a problem on an exam or in practice, you need to match the model to the data characteristics and business requirements. Don't just memorize algorithms—know what type of problem each one solves and what assumptions it makes about your data.

Regression Models: Predicting Continuous Values

These models estimate numerical outcomes by learning relationships between input features and a target variable. They work by fitting mathematical functions to training data, then using those functions to predict new values.

Linear Regression

Models relationships using a linear equation $y = mx + b$ —the simplest approach to predicting continuous outcomes like sales revenue or temperature
Assumes linearity between variables, which means performance degrades significantly when the true relationship is curved or complex
Highly sensitive to outliers—a single extreme data point can dramatically skew your prediction line, making data cleaning critical

Logistic Regression

Predicts probability of categorical outcomes using the logistic function to output values between 0 and 1—despite the name, it's a classification model
Assumes linear relationship with log-odds, expressed as $\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1x$ , where $p$ is the probability of the positive class
Extends to multiclass problems through one-vs-all or softmax approaches, making it versatile for real-world classification tasks

Compare: Linear Regression vs. Logistic Regression—both assume linear relationships and use similar mathematical foundations, but linear regression predicts continuous values while logistic regression predicts class probabilities. If an exam question asks about predicting "whether" something happens, think logistic; if it asks "how much," think linear.

Tree-Based Models: Learning Through Decisions

Tree-based algorithms split data into increasingly specific subgroups based on feature values. They mimic human decision-making by asking a series of yes/no questions, making them highly interpretable.

Decision Trees

Creates flowchart-like structures that split data at each node based on feature thresholds—easy to visualize and explain to non-technical stakeholders
Handles both numerical and categorical data without requiring extensive preprocessing or normalization
Prone to overfitting when trees grow too deep; pruning techniques and depth limits help control model complexity

Random Forests

Combines multiple decision trees through ensemble learning—each tree votes on the prediction, reducing variance and improving accuracy
Trains each tree on random data subsets (bootstrap aggregating or bagging), which prevents any single tree from memorizing noise
Provides feature importance scores automatically, helping you identify which variables drive predictions most strongly

Compare: Decision Trees vs. Random Forests—both use the same splitting logic, but a single decision tree is interpretable and fast while Random Forests sacrifice some interpretability for dramatically better generalization. Use decision trees when you need to explain every prediction; use Random Forests when accuracy matters most.

Distance and Boundary Models: Separating Classes in Space

These algorithms classify data by measuring distances or finding optimal boundaries in feature space. They treat each data point as a location in multi-dimensional space and make predictions based on geometric relationships.

Support Vector Machines (SVM)

Finds the optimal hyperplane that maximizes the margin between classes—the decision boundary sits as far as possible from both groups
Uses kernel functions to handle non-linear relationships by transforming data into higher dimensions where linear separation becomes possible
Excels in high-dimensional spaces where the number of features exceeds the number of samples, common in text classification and genomics

K-Nearest Neighbors (KNN)

Classifies based on majority vote of the $k$ closest training examples—a non-parametric approach that makes no assumptions about data distribution
Simple to implement and understand, requiring no training phase since it stores all data and computes distances at prediction time
Suffers from the curse of dimensionality—as features increase, distance metrics become less meaningful and performance degrades

Compare: SVM vs. KNN—both operate in feature space, but SVM learns a fixed decision boundary during training while KNN computes classifications on-the-fly using stored examples. SVM scales better to high dimensions; KNN is simpler but requires careful tuning of $k$ and distance metrics.

Probabilistic Models: Learning from Likelihood

Probabilistic algorithms use statistical principles to calculate the likelihood of outcomes. They apply probability theory—particularly Bayes' theorem—to make predictions based on observed evidence.

Naive Bayes

Applies Bayes' theorem $P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$ to calculate class probabilities given feature values
Assumes feature independence, meaning each predictor contributes independently to the outcome—a "naive" assumption that rarely holds but often works surprisingly well
Excels at text classification tasks like spam detection and sentiment analysis, where word frequencies serve as features

Neural Networks: Learning Complex Patterns

Neural networks use interconnected layers of artificial neurons to model highly complex, non-linear relationships. They learn hierarchical representations of data through backpropagation, adjusting connection weights to minimize prediction error.

Neural Networks

Organized in layers of neurons—input layer receives features, hidden layers extract patterns, output layer produces predictions
Particularly effective for unstructured data like images, audio, and text, where traditional feature engineering struggles to capture relevant patterns
Requires substantial data and compute power for training; techniques like dropout and regularization help prevent overfitting on smaller datasets

Compare: Naive Bayes vs. Neural Networks—both can tackle classification, but Naive Bayes is fast, interpretable, and works well with limited data, while Neural Networks require more resources but capture complex non-linear patterns. For text classification with clear features, try Naive Bayes first; for image recognition, neural networks are typically necessary.

Unsupervised Learning: Finding Hidden Structure

These algorithms discover patterns in data without labeled outcomes. They identify natural groupings or reduce complexity by finding the most informative representations of the data.

K-Means Clustering

Partitions data into $k$ clusters by iteratively assigning points to the nearest centroid and updating centroid positions until convergence
Requires specifying $k$ in advance—techniques like the elbow method or silhouette scores help determine the optimal number of clusters
Sensitive to initial centroid placement; running multiple initializations (k-means++) improves consistency and result quality

Principal Component Analysis (PCA)

Reduces dimensionality by projecting data onto principal components—the directions of maximum variance in the feature space
Preserves the most important information while eliminating redundant or noisy features, making subsequent analysis faster and more effective
Assumes linear relationships among features; non-linear alternatives like t-SNE or UMAP may capture complex structures better

Compare: K-Means vs. PCA—both are unsupervised, but they solve different problems. K-Means groups similar observations together (clustering), while PCA reduces the number of features while preserving variance (dimensionality reduction). Often used together: apply PCA first to reduce dimensions, then K-Means to cluster the simplified data.

Quick Reference Table

Concept	Best Examples
Predicting continuous values	Linear Regression
Binary/multiclass classification	Logistic Regression, SVM, Naive Bayes, KNN
Ensemble methods	Random Forests
High interpretability	Decision Trees, Linear Regression, Naive Bayes
High-dimensional data	SVM, PCA, Neural Networks
Text classification	Naive Bayes, SVM
Image/speech recognition	Neural Networks
Unsupervised clustering	K-Means
Dimensionality reduction	PCA
Instance-based learning	KNN

Self-Check Questions

Which two models both assume linear relationships but solve fundamentally different types of problems (regression vs. classification)?
You have a dataset with 10,000 features but only 500 samples. Which two models from this guide would be strong candidates, and why?
Compare and contrast Decision Trees and Random Forests: what does Random Forests gain by combining multiple trees, and what does it sacrifice?
A data scientist needs to segment customers into distinct groups without any predefined labels. Which model should they use, and what key parameter must they specify before running it?
If you needed to explain every prediction to a regulatory compliance team in plain language, which models would you prioritize and which would you avoid? Justify your reasoning based on interpretability trade-offs.

📊Big Data Analytics and Visualization

Popular Machine Learning Models

Why This Matters

Regression Models: Predicting Continuous Values

Linear Regression

Logistic Regression

Tree-Based Models: Learning Through Decisions

Decision Trees

Random Forests

Distance and Boundary Models: Separating Classes in Space

Support Vector Machines (SVM)

K-Nearest Neighbors (KNN)

Probabilistic Models: Learning from Likelihood

Naive Bayes

Neural Networks: Learning Complex Patterns

Neural Networks

Unsupervised Learning: Finding Hidden Structure

K-Means Clustering

Principal Component Analysis (PCA)

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes