upgrade
upgrade

🔮Forecasting

Key Concepts in Machine Learning Algorithms for Prediction to Know

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Machine learning algorithms are the backbone of modern forecasting—whether you're predicting stock prices, classifying customer behavior, or modeling time-dependent phenomena. You're being tested not just on what these algorithms do, but on when to apply them, why they work, and how they compare to one another. The exam expects you to understand the trade-offs: bias vs. variance, interpretability vs. accuracy, computational cost vs. predictive power.

Don't just memorize algorithm names and definitions. Know what problem type each algorithm solves (regression vs. classification vs. time series), understand the underlying mathematical intuition, and recognize when one approach outperforms another. If you can explain why random forests reduce overfitting compared to single decision trees, or when logistic regression beats a neural network, you're thinking like a data scientist—and that's exactly what the exam rewards.


Regression-Based Methods

These algorithms model continuous relationships between variables. They assume some functional form connecting inputs to outputs and estimate parameters to minimize prediction error.

Linear Regression

  • Models continuous outcomes using a linear equation—the relationship y=β0+β1x1+...+βnxn+ϵy = \beta_0 + \beta_1x_1 + ... + \beta_nx_n + \epsilon assumes constant, additive effects
  • Coefficients are directly interpretable as the change in yy for a one-unit change in each predictor, making it ideal for explanatory analysis
  • Highly sensitive to outliers because it minimizes squared error, which disproportionately weights extreme values

Logistic Regression

  • Predicts probability of binary outcomes using the logistic function 11+ez\frac{1}{1 + e^{-z}} to constrain outputs between 0 and 1
  • Outputs probabilities, not classes—you choose a threshold (typically 0.5) to convert predictions into categorical labels
  • Extends to multiclass problems through one-vs-all or softmax approaches, making it more versatile than the name suggests

Compare: Linear Regression vs. Logistic Regression—both estimate coefficients for predictor variables, but linear regression predicts continuous values while logistic regression predicts probabilities for categorical outcomes. If an FRQ asks about predicting "whether" something happens, logistic is your answer; if it asks "how much," think linear.


Tree-Based Methods

These algorithms partition the feature space into regions using sequential decision rules. They're intuitive because they mirror human decision-making: "If X, then Y."

Decision Trees

  • Splits data using feature thresholds in a flowchart structure, making predictions based on which terminal node (leaf) a data point reaches
  • Highly interpretable and visual—you can trace exactly why the model made a specific prediction, which matters for stakeholder communication
  • Prone to overfitting when trees grow too deep; pruning (limiting depth or minimum samples per leaf) controls this tendency

Random Forests

  • Combines hundreds of decision trees trained on random data subsets, averaging predictions to reduce variance and overfitting
  • Introduces randomness at two levels—bootstrap sampling of rows and random selection of features at each split—creating diverse, uncorrelated trees
  • Provides feature importance scores by measuring how much each variable reduces prediction error across all trees

Gradient Boosting Machines (XGBoost)

  • Builds trees sequentially where each new tree corrects the errors (residuals) of the combined previous trees
  • Dominates structured data competitions because it optimizes a regularized objective function that balances fit and complexity
  • Offers built-in regularization through parameters like learning rate, max depth, and L1/L2 penalties to prevent overfitting

Compare: Random Forests vs. Gradient Boosting—both are ensemble tree methods, but random forests build trees independently (parallel) while gradient boosting builds them sequentially (each correcting prior errors). Gradient boosting typically achieves higher accuracy but requires more careful tuning and is more prone to overfitting.


Distance and Probability-Based Methods

These algorithms classify data points based on their proximity to other points or their likelihood under probabilistic assumptions. They rely on measuring similarity or computing conditional probabilities.

K-Nearest Neighbors (KNN)

  • Classifies by majority vote of K nearest neighbors—a non-parametric method that makes no assumptions about data distribution
  • Computationally expensive at prediction time because it must calculate distances to all training points for each new observation
  • Performance depends heavily on K and distance metric—small K captures noise, large K oversmooths; Euclidean distance assumes equal feature importance

Naive Bayes

  • Applies Bayes' theorem with independence assumption—calculates P(classfeatures)P(class)×P(featureiclass)P(class|features) \propto P(class) \times \prod P(feature_i|class)
  • Excels at text classification tasks like spam detection because high-dimensional sparse data suits the independence assumption
  • Fast and data-efficient since it only needs to estimate univariate probability distributions, not complex joint distributions

Compare: KNN vs. Naive Bayes—both are simple, interpretable classifiers, but KNN is instance-based (stores all training data) while Naive Bayes is model-based (stores probability estimates). KNN struggles with high dimensions; Naive Bayes handles them well but assumes feature independence.


Hyperplane and Neural Methods

These algorithms find complex decision boundaries in high-dimensional spaces. They're powerful but often sacrifice interpretability for predictive accuracy.

Support Vector Machines (SVM)

  • Finds the optimal separating hyperplane that maximizes the margin between classes, focusing on the "support vectors" closest to the boundary
  • Uses kernel functions for non-linear problems—the kernel trick maps data to higher dimensions where linear separation becomes possible
  • Sensitive to hyperparameter choices including kernel type (linear, RBF, polynomial) and regularization parameter CC

Neural Networks

  • Layers of interconnected neurons learn hierarchical representations through weighted combinations and non-linear activation functions
  • Requires substantial data and compute—deep networks have millions of parameters and need large datasets to generalize well
  • Architecture choices drive performance—number of layers, neurons per layer, activation functions, and regularization techniques all significantly impact results

Compare: SVM vs. Neural Networks—both can model non-linear relationships, but SVMs work well with smaller datasets and clearer margins while neural networks excel with massive datasets and complex patterns. SVMs are more interpretable; neural networks are more flexible but act as "black boxes."


Time Series Methods

These algorithms specifically handle temporally ordered data where past values inform future predictions. They decompose patterns into trend, seasonality, and residual components.

Time Series Analysis (ARIMA, SARIMA)

  • ARIMA models combine autoregression, differencing, and moving averages—the parameters (p,d,q)(p, d, q) control lags, integration order, and error terms respectively
  • SARIMA extends ARIMA for seasonal patterns by adding seasonal components (P,D,Q,s)(P, D, Q, s) where ss is the seasonal period
  • Requires stationary data—if trends or changing variance exist, differencing or transformations (log, Box-Cox) must be applied first

Compare: ARIMA vs. Machine Learning approaches—ARIMA explicitly models temporal structure and works well with univariate series, while ML methods (random forests, neural networks) can incorporate multiple features but may miss temporal dependencies without feature engineering. For pure time series forecasting, ARIMA often provides a strong baseline.


Quick Reference Table

ConceptBest Examples
Continuous outcome predictionLinear Regression, Neural Networks, KNN (regression mode)
Binary/multiclass classificationLogistic Regression, SVM, Naive Bayes, KNN
Ensemble methods (variance reduction)Random Forests, Gradient Boosting (XGBoost)
High interpretabilityLinear Regression, Logistic Regression, Decision Trees
Text and high-dimensional dataNaive Bayes, SVM with kernels
Time-dependent forecastingARIMA, SARIMA
Non-linear pattern captureNeural Networks, SVM (kernel), Decision Trees
Small dataset performanceNaive Bayes, Logistic Regression, SVM

Self-Check Questions

  1. Which two algorithms are both ensemble methods, and what distinguishes how they combine individual models (parallel vs. sequential)?

  2. If you need to predict whether a customer will churn (yes/no) and explain to stakeholders exactly which factors drove the prediction, which algorithm would you choose and why?

  3. Compare and contrast KNN and Naive Bayes: What assumptions does each make, and in what data scenarios would you prefer one over the other?

  4. A dataset shows strong weekly seasonality in sales data. Which algorithm family is specifically designed for this, and what parameter would you adjust to capture the seasonal pattern?

  5. An FRQ asks you to recommend an algorithm for a dataset with 50 features, 500 observations, and a binary outcome. Defend your choice by explaining why high-complexity models like neural networks might underperform here.