🧠Machine Learning Engineering

Common Machine Learning Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Machine learning algorithms aren't just tools you memorize—they're solutions to specific types of problems, and understanding when and why to use each one is what separates a capable ML engineer from someone who just runs code. You're being tested on your ability to match algorithms to problem types, understand their underlying mathematical principles, and recognize their trade-offs in real-world applications. The concepts here—supervised vs. unsupervised learning, parametric vs. non-parametric models, bias-variance trade-offs, and ensemble methods—form the foundation of every ML system you'll build or evaluate.

Don't just memorize algorithm names and definitions. Know what problem type each algorithm solves, what assumptions it makes about your data, and what failure modes to watch for. When you encounter a new dataset or problem, you should immediately be able to narrow down your algorithm choices based on these principles. That's the thinking that gets tested in interviews, system design questions, and real engineering work.

Regression Algorithms: Predicting Continuous Values

These algorithms predict numerical outcomes by learning relationships between input features and target variables. The core principle is fitting a function that minimizes the difference between predicted and actual values.

Linear Regression

Models relationships using a linear equation $y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$ —the simplest supervised learning algorithm and often your baseline model
Assumes homoscedasticity and normally distributed errors—violations of these assumptions can invalidate your confidence intervals and hypothesis tests
Interpretable coefficients show exactly how much each feature contributes to the prediction, making it invaluable for understanding feature impact

Logistic Regression

Predicts probabilities for binary classification using the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ —despite the name, this is a classifier, not a regression algorithm
Outputs probabilities between 0 and 1 that you convert to class labels using a threshold (typically 0.5, but adjustable based on precision-recall trade-offs)
Highly interpretable and fast to train—often the first algorithm to try for binary classification before moving to more complex models

Compare: Linear Regression vs. Logistic Regression—both learn linear decision boundaries, but Linear Regression predicts continuous values while Logistic Regression predicts class probabilities. If asked to choose between them, the answer depends entirely on whether your target variable is continuous or categorical.

Tree-Based Methods: Learning Decision Rules

Tree-based algorithms split data recursively based on feature values, creating interpretable rule structures. They work by finding the feature splits that best separate your target classes or reduce prediction variance.

Decision Trees

Splits data into branches based on feature thresholds—each internal node represents a decision rule, each leaf represents a prediction
Highly interpretable and requires no feature scaling—you can literally trace the path from root to leaf to explain any prediction
Prone to overfitting without pruning or depth limits—a deep tree memorizes training data rather than learning generalizable patterns

Random Forests

Ensemble of decision trees trained on bootstrap samples—each tree sees a random subset of data and features, reducing correlation between trees
Reduces overfitting through averaging (bagging)—individual trees may overfit, but their aggregate prediction is more stable
Provides feature importance scores by measuring how much each feature reduces impurity across all trees—essential for feature selection

Gradient Boosting Algorithms (XGBoost, LightGBM)

Builds trees sequentially to correct previous errors—each new tree focuses on the residuals (mistakes) of the ensemble so far
Dominates structured/tabular data competitions—XGBoost and LightGBM are often the winning algorithms on Kaggle for non-image, non-text data
Requires careful regularization through learning rate, tree depth, and early stopping to prevent overfitting on training data

Compare: Random Forests vs. Gradient Boosting—both are tree ensembles, but Random Forests train trees independently (bagging) while Gradient Boosting trains them sequentially (boosting). Random Forests are more robust to hyperparameter choices; Gradient Boosting typically achieves higher accuracy but requires more tuning.

Distance-Based Methods: Learning from Similarity

These algorithms make predictions based on how similar new data points are to existing examples. The core assumption is that similar inputs should produce similar outputs.

K-Nearest Neighbors (KNN)

Classifies based on majority vote of k nearest neighbors—a non-parametric, instance-based method that stores all training data
No training phase, but expensive at prediction time—must compute distances to all training points for each new prediction, scaling poorly with dataset size
Highly sensitive to feature scaling and choice of k—always normalize features and use cross-validation to select k

Support Vector Machines (SVM)

Finds the maximum-margin hyperplane separating classes—the decision boundary that maximizes distance to the nearest points (support vectors)
Handles non-linear boundaries via kernel trick—transforms data into higher dimensions where linear separation becomes possible using kernels like RBF or polynomial
Effective in high-dimensional spaces but scales poorly with large datasets—training complexity is roughly $O(n^2)$ to $O(n^3)$

Compare: KNN vs. SVM—both rely on distance calculations, but KNN uses distances at prediction time while SVM uses them during training to find support vectors. KNN is simpler but slower at inference; SVM is slower to train but faster to predict.

Probabilistic Methods: Learning from Distributions

These algorithms explicitly model probability distributions over your data. They apply Bayes' theorem and statistical assumptions to make predictions.

Naive Bayes

Applies Bayes' theorem with independence assumption: $P(y|x) \propto P(y) \prod_{i} P(x_i|y)$ —assumes features are conditionally independent given the class
Extremely fast and effective for text classification—works surprisingly well despite the "naive" independence assumption rarely holding in practice
Handles high-dimensional sparse data well—the independence assumption actually helps prevent overfitting when you have many features

Unsupervised Learning: Finding Structure Without Labels

These algorithms discover patterns in data without target labels. They identify natural groupings or reduce dimensionality to reveal underlying structure.

K-Means Clustering

Partitions data into k clusters by minimizing within-cluster variance—iteratively assigns points to nearest centroid, then updates centroids until convergence
Requires specifying k in advance—use the elbow method or silhouette scores to select an appropriate number of clusters
Sensitive to initialization and assumes spherical clusters—run multiple times with different random seeds; consider DBSCAN for non-spherical clusters

Principal Component Analysis (PCA)

Projects data onto orthogonal axes of maximum variance—finds linear combinations of features that capture the most information
Reduces dimensionality while preserving variance—essential for visualization, noise reduction, and speeding up downstream algorithms
Components are linear and may not capture complex structure—consider t-SNE or UMAP for non-linear dimensionality reduction in visualization tasks

Compare: K-Means vs. PCA—both are unsupervised, but K-Means groups similar data points into clusters while PCA transforms features into a lower-dimensional representation. K-Means is for segmentation; PCA is for compression and preprocessing.

Deep Learning: Learning Hierarchical Representations

Neural networks learn complex, non-linear patterns through layers of transformations. Each layer extracts increasingly abstract features from the input data.

Neural Networks and Deep Learning

Composed of layers of neurons with learnable weights—each neuron computes $\sigma(\mathbf{w} \cdot \mathbf{x} + b)$ where $\sigma$ is an activation function
Excel at unstructured data like images, text, and audio—convolutional networks for images, transformers for text, recurrent networks for sequences
Require large datasets and significant compute—prone to overfitting on small datasets; use regularization, dropout, and data augmentation

Ensemble Principles: Combining Model Strengths

Ensemble methods improve performance by aggregating multiple models. The key insight is that diverse models make different errors, which cancel out when combined.

Ensemble Methods

Bagging trains models independently on bootstrap samples—reduces variance by averaging predictions (Random Forests are the canonical example)
Boosting trains models sequentially on weighted errors—reduces bias by focusing on hard examples (Gradient Boosting, AdaBoost)
Stacking combines models using a meta-learner—trains a second-level model on the predictions of base models to learn optimal combination weights

Compare: Bagging vs. Boosting—bagging reduces variance by averaging independent models; boosting reduces bias by sequentially correcting errors. Use bagging when your base model overfits; use boosting when your base model underfits.

Quick Reference Table

Concept	Best Examples
Regression (continuous targets)	Linear Regression
Binary/Multi-class Classification	Logistic Regression, SVM, Naive Bayes, KNN
Tree-Based Methods	Decision Trees, Random Forests, XGBoost, LightGBM
Ensemble Learning	Random Forests (bagging), Gradient Boosting (boosting)
Unsupervised Clustering	K-Means
Dimensionality Reduction	PCA
Deep Learning	Neural Networks, CNNs, Transformers
High-Dimensional Sparse Data	Naive Bayes, SVM, Logistic Regression

Self-Check Questions

You have a tabular dataset with 50 features and 10,000 rows, predicting a binary outcome. Which two algorithms would you try first, and why?
Compare and contrast Random Forests and Gradient Boosting: what type of error does each primarily reduce (bias or variance), and how does their training process differ?
A colleague suggests using KNN on a dataset with 1 million training examples. What performance concern would you raise, and what alternative algorithm might you suggest?
When would you choose Naive Bayes over Logistic Regression for a text classification problem? What assumption makes Naive Bayes particularly suited to high-dimensional sparse data?
You're preprocessing data for a downstream classification algorithm and want to reduce 500 features to 50. Which algorithm would you use, what does it optimize for, and what information might you lose in the process?

🧠Machine Learning Engineering

Common Machine Learning Algorithms

Why This Matters

Regression Algorithms: Predicting Continuous Values

Linear Regression

Logistic Regression

Tree-Based Methods: Learning Decision Rules

Decision Trees

Random Forests

Gradient Boosting Algorithms (XGBoost, LightGBM)

Distance-Based Methods: Learning from Similarity

K-Nearest Neighbors (KNN)

Support Vector Machines (SVM)

Probabilistic Methods: Learning from Distributions

Naive Bayes

Unsupervised Learning: Finding Structure Without Labels

K-Means Clustering

Principal Component Analysis (PCA)

Deep Learning: Learning Hierarchical Representations

Neural Networks and Deep Learning

Ensemble Principles: Combining Model Strengths

Ensemble Methods

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes