Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Machine learning models form the backbone of modern data analysis, and you're being tested on more than just definitions—you need to understand when to apply each model, why certain models outperform others in specific situations, and how they handle different types of data. The AP exam will ask you to choose appropriate models for given scenarios, interpret their outputs, and recognize their limitations. Think of this as building a toolkit where each algorithm solves a particular type of problem.
The key concepts here revolve around supervised vs. unsupervised learning, classification vs. regression tasks, model complexity vs. interpretability tradeoffs, and ensemble methods. When you encounter a question about predicting categories versus continuous values, or about handling high-dimensional data, you need to immediately connect it to the right model family. Don't just memorize what each algorithm does—know what problem type each one solves and why it's suited for that task.
These foundational models establish relationships between input features and outputs. Regression models estimate parameters that minimize prediction error, making them interpretable and widely applicable as baseline approaches.
Compare: Linear Regression vs. Logistic Regression—both model relationships between features and outcomes, but linear regression predicts continuous values while logistic regression predicts probabilities for categories. If an FRQ gives you a yes/no outcome, logistic is your answer.
Tree-based approaches work by recursively partitioning data based on feature values. Each split maximizes information gain or minimizes impurity, creating interpretable decision paths that mirror human reasoning.
Compare: Decision Trees vs. Random Forests—single trees are interpretable but overfit easily; random forests sacrifice some interpretability for significantly better accuracy and stability. When asked about the bias-variance tradeoff, this is your go-to example.
These models classify data by measuring similarity or finding optimal separations in feature space. The geometry of your data determines which approach works best—KNN relies on local neighborhoods while SVM finds global decision boundaries.
Compare: KNN vs. SVM—both are classification workhorses, but KNN makes local decisions based on neighbors while SVM finds a global boundary. KNN is simple but slow at prediction; SVM is complex to tune but efficient once trained.
Neural networks model intricate, non-linear relationships through layers of interconnected nodes. Each layer transforms inputs through weighted connections and activation functions, enabling the network to learn hierarchical representations.
Probabilistic classifiers use statistical principles to assign class labels. Bayes' theorem provides the mathematical foundation, calculating the probability of each class given the observed features.
Compare: Naive Bayes vs. Logistic Regression—both handle classification, but Naive Bayes assumes feature independence and uses probability directly, while logistic regression learns feature weights without independence assumptions. Naive Bayes trains faster; logistic regression often achieves higher accuracy when features interact.
Unlike supervised methods, these algorithms discover patterns in data without predefined outcomes. The goal is to reveal hidden structure—whether grouping similar items or reducing data complexity.
Compare: K-Means vs. PCA—both are unsupervised, but K-Means groups similar data points while PCA reduces feature dimensions. K-Means answers "what clusters exist?" while PCA answers "which features matter most?"
| Concept | Best Examples |
|---|---|
| Predicting continuous values | Linear Regression |
| Binary/multiclass classification | Logistic Regression, Naive Bayes, SVM, KNN |
| High interpretability | Decision Trees, Linear Regression |
| Ensemble methods (reduce overfitting) | Random Forests |
| High-dimensional data | SVM, PCA |
| Text classification | Naive Bayes |
| Complex pattern recognition | Neural Networks |
| Unsupervised clustering | K-Means |
| Dimensionality reduction | PCA |
Which two models are both used for classification but differ in whether they build an explicit model during training? What are the tradeoffs between them?
You're given a dataset with 50 features and only 30 samples. Which model would likely perform best, and why might you apply PCA first?
Compare and contrast Decision Trees and Random Forests in terms of interpretability, overfitting risk, and when you'd choose one over the other.
A company wants to segment customers into groups based on purchasing behavior, but they don't have predefined categories. Which algorithm should they use, and what key parameter must they specify?
An FRQ asks you to recommend a model for predicting whether an email is spam. Identify two appropriate models and explain why each would work, noting any preprocessing considerations.