Fiveable

👁️Computer Vision and Image Processing Unit 5 Review

QR code for Computer Vision and Image Processing practice questions

5.4 Decision trees and random forests

5.4 Decision trees and random forests

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
👁️Computer Vision and Image Processing
Unit & Topic Study Guides

Decision trees and random forests are classification tools that break down complex visual data into a series of simple decision rules. In computer vision, they're used for tasks like image classification and object detection because they're interpretable, efficient, and handle high-dimensional feature data well.

Random forests take the decision tree concept further by combining many trees into an ensemble. This reduces overfitting and captures more intricate patterns in image features than any single tree could manage on its own.

Decision tree fundamentals

A decision tree works by splitting data through a series of yes/no questions about feature values. Each question narrows down the possibilities until you reach a final prediction. Think of it as a flowchart: you start at the top, answer questions about your data at each step, and follow the appropriate branch until you land on a classification.

Structure of decision trees

  • Root node represents the entire dataset and asks the first splitting question
  • Internal nodes correspond to feature-based decisions that split the data further
  • Leaf nodes contain the final predictions or class labels
  • Branches connect nodes, representing the possible outcomes of each decision
  • Tree depth (the number of levels from root to the deepest leaf) controls how complex and granular the decision-making process gets

Splitting criteria

At each internal node, the tree needs to decide which feature and threshold to split on. Several criteria can guide that choice:

  • Gini impurity measures the probability of misclassifying a randomly chosen element from the node. Lower is better.
  • Information gain quantifies how much entropy (disorder) decreases after a split. Higher gain means a more informative split.
  • Gain ratio normalizes information gain to avoid bias toward features with many possible values.
  • Chi-square test evaluates whether the relationship between a feature and the target variable is statistically significant.
  • Mean decrease impurity ranks feature importance by measuring the total reduction in node impurity a feature provides across the tree.

Pruning techniques

Without pruning, a decision tree will keep splitting until every leaf is perfectly pure, which almost always means overfitting. Pruning cuts back the tree to improve generalization.

  • Pre-pruning stops tree growth early by setting constraints during training, such as limiting maximum depth, requiring a minimum number of samples per leaf, or setting a minimum impurity decrease for a split to occur.
  • Post-pruning grows the full tree first, then removes branches that don't meaningfully improve performance.
    • Cost complexity pruning balances tree size against classification error using a complexity parameter.
    • Reduced error pruning replaces subtrees with leaf nodes whenever doing so doesn't decrease accuracy on a validation set.
  • Minimal cost-complexity pruning searches for the subtree that achieves the lowest cost-complexity measure, giving you the best trade-off between simplicity and accuracy.

Random forest overview

A random forest is an ensemble of many decision trees, each trained slightly differently. The forest aggregates their individual predictions (by majority vote for classification, or averaging for regression) to produce a final output that's more accurate and stable than any single tree.

Ensemble learning principles

The core idea is "wisdom of the crowd": many imperfect models, when combined, often outperform a single highly tuned model. For this to work, the individual trees need to be diverse, meaning they make different errors. Random forests achieve diversity through two mechanisms: bootstrapping the training data and randomly restricting which features each split can consider.

  • Bootstrapping creates a different training subset for each tree by sampling with replacement
  • Aggregation combines predictions from all trees into one final output
  • Parallel processing is straightforward since each tree trains independently of the others

Bagging vs boosting

These are two distinct ensemble strategies, and it's worth understanding how they differ:

  • Bagging (Bootstrap Aggregating) builds trees independently and in parallel. Each tree trains on a random subset of the data sampled with replacement. This primarily reduces variance and helps prevent overfitting. Random forests use bagging.
  • Boosting builds trees sequentially, where each new tree focuses on correcting the mistakes of the previous ones. Misclassified samples receive higher weights in subsequent iterations. This primarily reduces bias. Gradient boosting machines (GBMs) use this approach.

The key distinction: bagging keeps all samples equally weighted, while boosting adjusts weights to focus on hard-to-classify examples.

Training decision trees

Training a decision tree for computer vision means selecting the right image features, encoding them properly, and handling any gaps in the data. The goal is to build a tree that captures meaningful visual patterns without memorizing noise.

Feature selection

  • Filter methods rank features using statistical measures like correlation or chi-square tests before training begins
  • Wrapper methods use search algorithms (e.g., recursive feature elimination) to find the best feature subset by evaluating model performance
  • Embedded methods perform feature selection during training itself, such as L1 regularization, which drives unimportant feature weights to zero
  • PCA (Principal Component Analysis) reduces dimensionality by transforming features into a smaller set of uncorrelated components
  • Mutual information measures how much knowing a feature's value tells you about the target variable

Handling categorical variables

Image features are often numerical (pixel intensities, edge magnitudes), but some features like color labels or texture categories are categorical. Common encoding strategies:

  • One-hot encoding creates a separate binary column for each category
  • Label encoding assigns a unique integer to each category (be cautious, since trees may interpret the ordering as meaningful)
  • Binary encoding represents categories as binary codes, reducing dimensionality compared to one-hot
  • Target encoding replaces each category with the mean target value for that category
  • Frequency encoding replaces categories with how often they appear in the dataset

Dealing with missing data

  • Deletion removes samples or features with missing values (simple but can lose useful data)
  • Mean/median/mode imputation fills missing values with the central tendency of that feature
  • KNN imputation estimates missing values using the values from the most similar samples
  • Multiple imputation generates several plausible filled-in datasets and averages results across them
  • Predictive models train a separate model to estimate missing values based on other features

Random forest construction

Building a random forest involves choosing how many trees to grow, how to sample the data, and how much randomness to inject into each tree's construction.

Number of trees

More trees generally improve performance, but with diminishing returns. Going from 10 to 100 trees makes a big difference; going from 500 to 1000 usually doesn't. Typical ranges fall between 100 and 1000 trees. Use cross-validation to find the sweet spot for your dataset, keeping in mind the trade-off between accuracy and computational cost.

Structure of decision trees, scikit-learn による決定木構築 | Python / scikit-learn を利用した決定木の作成

Bootstrap sampling

Each tree in the forest trains on a bootstrap sample: a dataset created by sampling with replacement from the original training set. This means some samples appear multiple times while others are left out entirely.

  • On average, about 63.2% of unique samples end up in each bootstrap sample
  • The remaining ~36.8% are called out-of-bag (OOB) samples and can serve as a built-in validation set
  • This process ensures each tree sees a slightly different version of the data, which drives the diversity that makes the ensemble effective

Feature randomness

At each split, the tree doesn't consider all available features. Instead, it randomly selects a subset and picks the best split from that subset only.

  • For classification tasks, the typical subset size is p\sqrt{p}, where pp is the total number of features
  • For regression tasks, the typical subset size is p/3p/3
  • This forces trees to use different features in their splits, reducing correlation between trees and preventing any single dominant feature from appearing in every tree

Advantages and limitations

Decision trees vs random forests

AspectDecision TreesRandom Forests
InterpretabilityHigh (easy to visualize and explain)Low (hundreds of trees are hard to inspect)
Overfitting riskHigh (especially deep trees)Low (ensemble averaging smooths out noise)
Training speedFastSlower (scales with number of trees)
High-dimensional dataCan struggleHandles it well
GeneralizationWeakerStronger

Overfitting prevention

Random forests have several built-in defenses against overfitting:

  • Ensemble averaging smooths out the noise that individual trees pick up
  • Bagging ensures each tree trains on a different data subset, so they don't all memorize the same patterns
  • Feature randomness prevents any single feature from dominating across all trees
  • Pruning (applied to individual trees) controls complexity at the single-tree level
  • Cross-validation helps optimize hyperparameters to find the right balance

Computational complexity

  • Training time increases roughly linearly with the number of trees
  • Prediction time for a single sample scales with tree depth (logarithmic in the number of training samples for balanced trees), multiplied by the number of trees
  • Trees in a forest are independent, so training and prediction both parallelize well
  • Memory requirements grow with the number and depth of trees
  • Feature importance calculations add some overhead but are generally not the bottleneck

Hyperparameter tuning

Tuning hyperparameters is how you adapt decision trees and random forests to your specific dataset. The three most impactful parameters are tree depth, minimum samples per leaf, and the number of features considered at each split.

Tree depth

  • Controls the maximum number of levels from root to leaf
  • Deeper trees capture more complex patterns but risk overfitting to training noise
  • Shallower trees generalize better but may underfit if the data has complex structure
  • Use grid search, random search, or early stopping based on validation performance to find the right depth

Minimum samples per leaf

  • Sets the smallest number of training samples that can end up in a leaf node
  • Larger values (e.g., 10-50) force the tree to make broader generalizations, reducing overfitting
  • Smaller values (e.g., 1-5) let the tree learn fine-grained patterns, but those patterns may just be noise
  • Can be specified as a fixed count or a percentage of total samples

Number of features

  • Determines how many features are randomly selected as candidates at each split
  • Default is p\sqrt{p} for classification and p/3p/3 for regression, where pp is the total feature count
  • Increasing this value gives each tree more information per split but reduces diversity across the forest
  • Decreasing it adds more randomness, which can help prevent overfitting but may hurt individual tree quality

Evaluation metrics

Accuracy and precision

  • Accuracy is the fraction of all predictions that are correct: correct predictionstotal predictions\frac{\text{correct predictions}}{\text{total predictions}}
  • Precision is the fraction of positive predictions that are actually positive: TPTP+FP\frac{TP}{TP + FP}
  • Recall (sensitivity) is the fraction of actual positives that are correctly identified: TPTP+FN\frac{TP}{TP + FN}
  • F1-score is the harmonic mean of precision and recall, useful when you need to balance both: F1=2precisionrecallprecision+recallF1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}
  • ROC-AUC measures classification performance across all decision thresholds, with 1.0 being perfect and 0.5 being random chance
Structure of decision trees, Using a Decision Tree | Principles of Management

Gini impurity

Gini impurity tells you how "mixed" a node is. If a node contains only one class, its Gini impurity is 0 (perfectly pure). For binary classification, the maximum impurity is 0.5 (an even 50/50 split).

The formula is:

Gini=1i=1cpi2Gini = 1 - \sum_{i=1}^{c} p_i^2

where pip_i is the proportion of samples belonging to class ii, and cc is the number of classes. For example, a node with 70% class A and 30% class B has Gini impurity of 1(0.72+0.32)=10.58=0.421 - (0.7^2 + 0.3^2) = 1 - 0.58 = 0.42.

Information gain

Information gain measures how much a split reduces uncertainty (entropy) in the data. The tree picks the split that maximizes information gain.

Entropy is defined as:

H=i=1cpilog2(pi)H = -\sum_{i=1}^{c} p_i \log_2(p_i)

Information gain is then:

IG=H(parent)jNjNH(childj)IG = H(\text{parent}) - \sum_{j} \frac{N_j}{N} H(\text{child}_j)

where NjN_j is the number of samples in child node jj and NN is the total samples in the parent. Higher information gain means the split does a better job of separating classes.

Applications in computer vision

Image classification tasks

Decision trees and random forests can classify images by learning rules over extracted features (color histograms, texture descriptors, edge statistics, etc.):

  • Object/scene categorization into predefined classes
  • Texture classification for material recognition or surface analysis
  • Facial expression recognition for emotion detection systems
  • Medical image classification for tasks like tumor detection or disease diagnosis
  • Satellite image classification for land use and land cover mapping

Object detection

For object detection, tree-based models typically work on features extracted from candidate regions:

  • Locating and identifying multiple objects within a single image
  • Bounding box regression for precise object localization
  • Using feature importance to identify which visual cues matter most for detection
  • Hierarchical detection strategies that mirror the tree's own structure
  • Ensemble methods that improve detection accuracy by combining multiple models

Feature importance analysis

One of the biggest practical advantages of random forests is that they naturally rank features by importance:

  • You can identify which image features (color channels, texture measures, shape descriptors) contribute most to classification
  • This guides feature engineering: if edge features dominate, you might invest in better edge detection
  • Feature importance helps with dimensionality reduction by letting you drop uninformative features
  • It also aids in understanding what the model is actually looking at, which is valuable for debugging and building trust in the system

Visualization techniques

Tree structure representation

  • Node-link diagrams are the most common way to display a decision tree's hierarchical structure
  • Color-coding nodes by class probability or impurity makes it easy to spot where the tree is confident vs uncertain
  • Interactive visualizations let you expand and collapse subtrees to explore different levels
  • Pruned tree visualizations highlight only the most important decision paths
  • Sankey diagrams can represent the flow of samples through the tree, showing how data gets partitioned at each split

Feature importance plots

  • Bar charts ranking features by importance score are the standard visualization
  • Heat maps can show how feature importance varies across multiple trees in the forest
  • Scatter plots of feature importance vs. feature correlation help identify redundant features
  • Grouped bar charts comparing importance across different classes reveal which features matter for which categories

Decision boundaries

  • 2D scatter plots with overlaid decision boundaries work well when you're looking at two features at a time
  • Contour plots show predicted class probabilities across feature space
  • Partial dependence plots illustrate how changing one feature affects the model's prediction while averaging over all other features
  • These visualizations are especially useful for understanding how the model partitions feature space in image classification tasks

Advanced concepts

Extremely randomized trees

Extremely randomized trees (Extra-Trees) push the randomness further than standard random forests. Instead of searching for the optimal split threshold for each candidate feature, they pick split thresholds at random. This has two effects:

  • Training is faster because there's no search for the best threshold
  • Variance is reduced even further compared to standard random forests
  • The trade-off is slightly higher bias, but in practice Extra-Trees often generalize well on computer vision tasks

Gradient boosting machines

Unlike random forests (which use bagging), gradient boosting builds trees sequentially:

  1. Train an initial tree on the data
  2. Compute the residual errors (the gap between predictions and true values)
  3. Train the next tree to predict those residuals
  4. Add the new tree's predictions to the running total
  5. Repeat until you reach the desired number of trees

Each tree corrects the mistakes of the ensemble so far. This typically produces stronger predictive models than random forests, but requires more careful tuning to avoid overfitting. Popular implementations include XGBoost, LightGBM, and CatBoost.

Random forest vs deep learning

AspectRandom ForestsDeep Learning
Dataset sizeWork well on small to medium datasetsNeed large datasets to shine
Feature engineeringRequire hand-crafted featuresLearn features automatically from raw pixels
InterpretabilityFeature importance is straightforwardOften treated as a black box
Overfitting on small dataMore resistantMore prone
Large-scale image recognitionCompetitive but limitedState of the art
Hybrid approaches exist that use deep learning to extract features and then feed those features into a random forest for the final classification, combining the representation power of neural networks with the robustness of ensemble methods.