⛽️Business Analytics

Key Predictive Analytics Models

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Predictive analytics sits at the heart of modern business decision-making, and your exams will test whether you understand when to apply each model—not just what each model does. The models in this guide represent the core toolkit for forecasting sales, classifying customers, detecting fraud, and optimizing operations. You'll encounter questions that ask you to select the right model for a given business scenario, interpret model outputs, and explain trade-offs between accuracy, interpretability, and computational cost.

These models fall into distinct categories based on their underlying mechanisms: regression for continuous outcomes, classification for categorical predictions, ensemble methods for improved accuracy, and unsupervised learning for pattern discovery. Don't just memorize definitions—know what problem type each model solves, what assumptions it makes, and when it outperforms alternatives. That conceptual understanding is what separates strong exam performance from mediocre recall.

Regression Models: Predicting Continuous & Categorical Outcomes

Regression models form the foundation of predictive analytics by establishing mathematical relationships between input variables and outcomes. The key distinction is whether you're predicting a number (linear regression) or a category (logistic regression).

Linear Regression

Models continuous outcomes using a linear equation—predicts values like sales revenue, prices, or demand quantities based on one or more independent variables
Coefficient interpretation tells you the strength and direction of each predictor's impact; a coefficient of 2.5 means the outcome increases by 2.5 units for each unit increase in that variable
Assumes linearity and is sensitive to outliers—always check residual plots and consider transformations when relationships aren't truly linear

Logistic Regression

Classifies binary outcomes (yes/no, churn/retain)—outputs a probability between 0 and 1 using the logistic function $P(Y=1) = \frac{1}{1 + e^{-z}}$
Probability threshold determines classification—typically 0.5, but you can adjust based on business costs of false positives vs. false negatives
Widely used for customer churn prediction and marketing response modeling—interpretable coefficients make it easy to explain which factors drive outcomes

Compare: Linear Regression vs. Logistic Regression—both model relationships between variables, but linear regression predicts continuous values while logistic regression predicts category probabilities. If an exam question involves predicting "how much" use linear; if it's "which group," use logistic.

Tree-Based Models: Intuitive Decision Logic

Tree-based models split data into branches based on feature values, creating rule-based predictions that mirror human decision-making. Their visual, flowchart structure makes them exceptionally interpretable for business stakeholders.

Decision Trees

Splits data hierarchically based on feature thresholds—creates if-then rules like "if income > $50K AND age < 35, then high purchase probability"
Handles both categorical and continuous variables without requiring extensive preprocessing or normalization
Prone to overfitting but can be pruned to improve generalization; watch for trees that memorize training data rather than learning patterns

Random Forests

Ensemble method combining hundreds of decision trees—each tree votes, and the majority or average determines the final prediction
Reduces overfitting through averaging and introduces randomness by training each tree on a bootstrap sample with random feature subsets
Provides feature importance scores—critical for identifying which variables drive predictions in credit scoring, fraud detection, and customer analytics

Compare: Decision Trees vs. Random Forests—single trees are highly interpretable but overfit easily; random forests sacrifice some interpretability for significantly better accuracy and robustness. Choose decision trees when you need to explain every rule; choose random forests when prediction accuracy matters most.

Advanced Classification Models: Handling Complex Boundaries

When simple linear boundaries won't separate your classes, these models find more sophisticated decision boundaries. They excel when relationships between features and outcomes are non-linear or when you're working in high-dimensional spaces.

Support Vector Machines (SVM)

Finds the optimal hyperplane that maximizes the margin between classes—the "street" between data points should be as wide as possible
Kernel functions handle non-linear relationships—the kernel trick maps data into higher dimensions where linear separation becomes possible
Robust in high-dimensional spaces—performs well even when features outnumber observations, common in text classification and genomics

K-Nearest Neighbors (KNN)

Instance-based learning that classifies new points based on the majority class of their $k$ closest neighbors in feature space
No training phase—stores all data and computes distances at prediction time, making it computationally expensive for large datasets
Sensitive to the choice of $k$ and distance metric—small $k$ values overfit to noise; large $k$ values oversmooth; Euclidean distance assumes all features are equally important

Naive Bayes

Probabilistic classifier using Bayes' theorem—calculates $P(\text{class}|\text{features})$ by assuming all features are independent given the class
Surprisingly effective despite the "naive" independence assumption—fast training and prediction make it ideal for real-time applications
Excels at text classification including spam detection, sentiment analysis, and document categorization where high-dimensional sparse data is common

Compare: SVM vs. KNN—both handle non-linear classification, but SVM learns a decision boundary during training while KNN makes decisions at prediction time. SVM scales better to large datasets; KNN requires no training but slows down as data grows.

Neural Networks: Learning Complex Patterns

Neural networks use layers of interconnected nodes to learn hierarchical representations of data. Their power comes from automatically discovering relevant features rather than requiring manual feature engineering.

Neural Networks

Composed of layers of neurons with weighted connections—input layer receives features, hidden layers transform them, output layer produces predictions
Captures highly non-linear relationships through activation functions and deep architectures; excels when you have massive datasets and complex patterns
Requires significant computational resources and data—prone to overfitting on small datasets; acts as a "black box" with limited interpretability

Compare: Neural Networks vs. Random Forests—both handle complex non-linear relationships, but random forests provide feature importance and work well with smaller datasets, while neural networks require more data but can learn more sophisticated patterns. For business applications requiring explainability, lean toward random forests.

Time-Dependent Models: Forecasting Sequential Data

Time series models specifically handle data where observations are ordered chronologically and past values influence future outcomes. The key insight is that time introduces dependencies—today's value relates to yesterday's.

Time Series Analysis

Analyzes data collected at regular time intervals—techniques like ARIMA, exponential smoothing, and seasonal decomposition capture trends, seasonality, and cycles
Decomposes patterns into components: trend (long-term direction), seasonality (repeating patterns), and residual (random noise)
Essential for demand forecasting, stock prediction, and inventory management—businesses use these models to anticipate future values based on historical patterns

Unsupervised Learning: Discovering Hidden Structure

Unlike supervised models that predict known outcomes, unsupervised models find patterns in data without labeled examples. They're exploratory tools that reveal structure you didn't know existed.

Clustering Algorithms

Groups similar data points without predefined labels—algorithms like K-means, hierarchical clustering, and DBSCAN identify natural groupings based on feature similarity
K-means partitions data into $k$ clusters by minimizing within-cluster variance; you must specify $k$ in advance, which requires domain knowledge or techniques like the elbow method
Drives market segmentation and customer profiling—reveals distinct customer groups that can be targeted with tailored marketing strategies

Compare: Clustering vs. Classification—classification assigns data to known categories using labeled training data; clustering discovers unknown groupings without labels. Use classification when you know what groups exist; use clustering when you're exploring what groups might exist.

Quick Reference Table

Concept	Best Examples
Predicting continuous values	Linear Regression, Neural Networks, Time Series
Binary classification	Logistic Regression, SVM, Naive Bayes
Multi-class classification	Decision Trees, Random Forests, KNN, Neural Networks
High interpretability	Linear Regression, Logistic Regression, Decision Trees
Handling non-linear relationships	SVM (with kernels), Neural Networks, Random Forests
Large dataset performance	Random Forests, Neural Networks, Naive Bayes
Discovering hidden patterns	Clustering Algorithms
Sequential/temporal data	Time Series Analysis

Self-Check Questions

A marketing team wants to predict which customers will respond to a campaign (yes/no). Which two models would be most appropriate, and what trade-off exists between them in terms of interpretability?
Compare and contrast Random Forests and Decision Trees: what problem does Random Forests solve that single Decision Trees struggle with, and what do you sacrifice by using the ensemble approach?
You have a dataset with 10,000 features but only 500 observations. Which model is specifically designed to handle this high-dimensional scenario, and why does it work well here?
A retail company needs to identify distinct customer segments for targeted marketing but has no predefined categories. Which type of model should they use, and how does this differ from classification?
An FRQ asks you to recommend a model for predicting next quarter's sales based on five years of monthly data. Which model category is designed for this problem, and what three components would you expect to identify in the data?

⛽️Business Analytics

Key Predictive Analytics Models

Why This Matters

Regression Models: Predicting Continuous & Categorical Outcomes

Linear Regression

Logistic Regression

Tree-Based Models: Intuitive Decision Logic

Decision Trees

Random Forests

Advanced Classification Models: Handling Complex Boundaries

Support Vector Machines (SVM)

K-Nearest Neighbors (KNN)

Naive Bayes

Neural Networks: Learning Complex Patterns

Neural Networks

Time-Dependent Models: Forecasting Sequential Data

Time Series Analysis

Unsupervised Learning: Discovering Hidden Structure

Clustering Algorithms

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes