upgrade
upgrade

📊Advanced Quantitative Methods

Key Data Mining Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data mining algorithms form the analytical backbone of modern business intelligence, and you're being tested on more than just definitions. Exam questions will probe your understanding of when to apply each algorithm, what assumptions they require, and how they compare in handling different data scenarios. Whether you're predicting customer churn, segmenting markets, or uncovering hidden patterns in transaction data, the algorithm you choose—and why you choose it—demonstrates true quantitative fluency.

The algorithms in this guide cluster around core concepts: supervised vs. unsupervised learning, classification vs. regression, ensemble methods vs. single models, and dimensionality reduction vs. pattern discovery. Don't just memorize what each algorithm does—know what problem type it solves, what assumptions it makes, and when it outperforms alternatives. That's what separates a strong FRQ response from a mediocre one.


Supervised Learning: Classification Algorithms

These algorithms learn from labeled data to predict categorical outcomes. The key mechanism is using known outcomes to train a model that generalizes to new, unseen cases.

Logistic Regression

  • Models probability of binary outcomes—uses the logistic (sigmoid) function to constrain predictions between 0 and 1, making it ideal for yes/no business decisions
  • Interpretable coefficients show the effect of each predictor on log-odds, allowing you to explain why a prediction was made to stakeholders
  • Assumes linearity in log-odds—if the relationship between predictors and outcome is highly non-linear, this algorithm will underperform

Decision Trees

  • Splits data recursively based on feature values—creates a tree-like structure where each branch represents a decision rule, making the model highly interpretable
  • No assumptions about data distribution—works with both numerical and categorical variables without requiring normalization
  • Prone to overfitting on complex datasets, especially when trees grow deep; requires pruning or ensemble methods to generalize well

Support Vector Machines (SVM)

  • Finds the optimal hyperplane that maximizes the margin between classes—the margin is the distance between the decision boundary and the nearest data points
  • Kernel functions (RBF, polynomial) enable classification of non-linearly separable data by projecting into higher dimensions
  • Sensitive to parameter tuning—the choice of kernel and regularization parameter CC significantly impacts performance

Naive Bayes

  • Probabilistic classifier based on Bayes' theorem—calculates P(classfeatures)P(\text{class} | \text{features}) assuming all features are conditionally independent
  • Excels at text classification and high-dimensional sparse data (spam filtering, sentiment analysis) despite its simplifying assumptions
  • Fast training and prediction—requires minimal data to estimate parameters, making it ideal for rapid prototyping

Compare: Logistic Regression vs. Naive Bayes—both produce probability estimates for classification, but logistic regression models feature relationships directly while Naive Bayes assumes feature independence. If an FRQ asks about interpretability, choose logistic regression; for text data with many features, Naive Bayes often wins.

K-Nearest Neighbors (KNN)

  • Instance-based learning—classifies new points by majority vote of the KK nearest neighbors; no explicit model is built during training
  • Choice of KK controls bias-variance tradeoff—small KK leads to high variance (overfitting), large KK leads to high bias (underfitting)
  • Computationally expensive at prediction time—must calculate distances to all training points, making it impractical for large datasets

Compare: SVM vs. KNN—both can handle non-linear boundaries, but SVM builds an explicit decision boundary while KNN stores all training data. SVM scales better to large datasets; KNN requires no training but expensive prediction.


Supervised Learning: Regression Algorithms

These algorithms predict continuous numerical outcomes rather than categories. The core mechanism is minimizing the difference between predicted and actual values.

Linear Regression

  • Models linear relationships between dependent and independent variables by minimizing the sum of squared errors (yiy^i)2\sum(y_i - \hat{y}_i)^2
  • Coefficients indicate effect size and direction—a one-unit increase in xx produces a β\beta change in yy, enabling clear business interpretation
  • Assumes homoscedasticity and normality of residuals—violations lead to unreliable standard errors and hypothesis tests

Compare: Linear vs. Logistic Regression—linear regression predicts continuous values, logistic regression predicts probabilities for categorical outcomes. The key difference is the link function: identity for linear, logit for logistic. Know which to apply based on your dependent variable type.


Ensemble Methods: Combining Weak Learners

Ensemble methods aggregate multiple models to reduce variance, bias, or both. The underlying principle is that diverse models making independent errors will cancel out when combined.

Random Forests

  • Aggregates many decision trees trained on random subsets of data and features—bagging reduces variance while feature randomization reduces correlation between trees
  • Provides feature importance scores by measuring how much each variable decreases impurity across all trees, aiding variable selection
  • Robust to overfitting compared to single decision trees, though interpretability is sacrificed for predictive power

Gradient Boosting Machines (XGBoost)

  • Builds trees sequentially—each new tree corrects the errors (residuals) of the combined ensemble so far, focusing on hard-to-predict cases
  • State-of-the-art performance in structured data competitions; XGBoost adds regularization and efficient computation to standard gradient boosting
  • Requires careful hyperparameter tuning—learning rate, tree depth, and number of iterations must be balanced to prevent overfitting

Compare: Random Forests vs. Gradient Boosting—both use decision trees, but Random Forests train trees independently (parallel) while Gradient Boosting trains sequentially (iterative). Random Forests are more robust to hyperparameter choices; Gradient Boosting typically achieves higher accuracy with proper tuning.


Unsupervised Learning: Clustering Algorithms

Clustering algorithms find natural groupings in unlabeled data. The goal is to maximize similarity within clusters and dissimilarity between clusters.

K-Means Clustering

  • Partitions data into KK clusters by iteratively assigning points to the nearest centroid and updating centroids to cluster means
  • Requires pre-specifying KK—use the elbow method or silhouette scores to determine optimal cluster count
  • Assumes spherical, equally-sized clusters—struggles with irregular shapes or clusters of varying density

Hierarchical Clustering

  • Builds a dendrogram showing nested cluster relationships—can be agglomerative (bottom-up merging) or divisive (top-down splitting)
  • No need to pre-specify cluster count—cut the dendrogram at any level to obtain different numbers of clusters
  • Computationally intensive—requires O(n2)O(n^2) distance calculations, making it impractical for very large datasets

DBSCAN

  • Density-based clustering that groups points in high-density regions while marking low-density points as outliers (noise)
  • Discovers arbitrarily shaped clusters—unlike K-Means, can identify non-spherical patterns and naturally handles outliers
  • Sensitive to parameters ϵ\epsilon (neighborhood radius) and MinPts—performance varies significantly with parameter choices

Compare: K-Means vs. DBSCAN—K-Means requires specifying KK and assumes spherical clusters; DBSCAN automatically determines cluster count and handles irregular shapes. Choose K-Means for well-separated, round clusters; choose DBSCAN when outliers are meaningful or cluster shapes are unknown.


Dimensionality Reduction and Pattern Discovery

These techniques simplify complex data or reveal hidden relationships. The goal is to reduce noise, enable visualization, or uncover structure that isn't immediately apparent.

Principal Component Analysis (PCA)

  • Transforms features into uncorrelated principal components ordered by variance explained—the first component captures the most variance, the second captures the most remaining variance orthogonal to the first, and so on
  • Enables visualization of high-dimensional data by projecting onto 2-3 principal components while preserving maximum information
  • Assumes linear relationships—non-linear patterns may require kernel PCA or other manifold learning techniques

Association Rule Mining (Apriori)

  • Discovers co-occurrence patterns in transactional data using metrics: support (frequency), confidence (conditional probability), and lift (improvement over random)
  • Market basket analysis is the classic application—identifying which products are frequently purchased together
  • Parameter tuning is critical—low thresholds generate overwhelming numbers of trivial rules; high thresholds may miss meaningful patterns

Compare: PCA vs. Association Rule Mining—both uncover hidden structure, but PCA reduces continuous feature dimensions while Association Rules find discrete item relationships. PCA is for numeric data preprocessing; Association Rules are for categorical transaction analysis.


Deep Learning: Neural Networks

Neural networks model complex, non-linear relationships through layered architectures. Information flows through interconnected nodes, with each layer learning increasingly abstract representations.

Neural Networks

  • Composed of layers of neurons—input layer receives features, hidden layers learn representations, output layer produces predictions
  • Universal approximators—with sufficient neurons and layers, can model virtually any function, including highly non-linear relationships
  • Resource-intensive—requires significant computational power, large training datasets, and careful hyperparameter tuning (learning rate, architecture, regularization)

Time-Dependent Analysis

Time series methods account for temporal structure in data. The key challenge is that observations are not independent—past values influence future values.

Time Series Analysis

  • Captures trends, seasonality, and cycles in time-ordered data using methods like ARIMA, exponential smoothing, and seasonal decomposition
  • Stationarity is often required—many methods assume constant mean and variance over time; differencing or transformations may be needed
  • Forecasting applications include demand planning, financial prediction, and capacity planning based on historical patterns

Compare: Neural Networks vs. Time Series (ARIMA)—neural networks (especially LSTMs) can capture complex non-linear temporal patterns but require more data; ARIMA provides interpretable parameters and works well with smaller datasets showing clear linear trends.


Quick Reference Table

ConceptBest Examples
Binary ClassificationLogistic Regression, SVM, Naive Bayes
Multi-class ClassificationDecision Trees, KNN, Neural Networks
Continuous PredictionLinear Regression, Neural Networks, Gradient Boosting
Ensemble MethodsRandom Forests, XGBoost
Distance-based ClusteringK-Means, Hierarchical Clustering
Density-based ClusteringDBSCAN
Dimensionality ReductionPCA
Pattern DiscoveryAssociation Rule Mining (Apriori)
Temporal ForecastingTime Series Analysis (ARIMA)

Self-Check Questions

  1. Which two algorithms both use decision trees as base learners but differ in how they combine them? Explain the key difference in their training approach.

  2. You have a dataset with potential outliers and clusters of irregular shapes. Compare K-Means and DBSCAN—which would you choose and why?

  3. A business stakeholder asks you to explain why a customer was classified as high-risk. Which classification algorithms would allow you to provide interpretable explanations, and which would not?

  4. Compare and contrast PCA and feature selection using Random Forest importance scores. When would you prefer one approach over the other?

  5. You need to predict whether a customer will churn (yes/no) using 50 features, some of which may be irrelevant. Recommend an algorithm and justify your choice based on the algorithm's properties discussed in this guide.