Decision trees and random forests are classification tools that break down complex visual data into a series of simple decision rules. In computer vision, they're used for tasks like image classification and object detection because they're interpretable, efficient, and handle high-dimensional feature data well.

Random forests take the decision tree concept further by combining many trees into an ensemble. This reduces overfitting and captures more intricate patterns in image features than any single tree could manage on its own.

Decision tree fundamentals

A decision tree works by splitting data through a series of yes/no questions about feature values. Each question narrows down the possibilities until you reach a final prediction. Think of it as a flowchart: you start at the top, answer questions about your data at each step, and follow the appropriate branch until you land on a classification.

Structure of decision trees

Root node represents the entire dataset and asks the first splitting question
Internal nodes correspond to feature-based decisions that split the data further
Leaf nodes contain the final predictions or class labels
Branches connect nodes, representing the possible outcomes of each decision
Tree depth (the number of levels from root to the deepest leaf) controls how complex and granular the decision-making process gets

Splitting criteria

At each internal node, the tree needs to decide which feature and threshold to split on. Several criteria can guide that choice:

Gini impurity measures the probability of misclassifying a randomly chosen element from the node. Lower is better.
Information gain quantifies how much entropy (disorder) decreases after a split. Higher gain means a more informative split.
Gain ratio normalizes information gain to avoid bias toward features with many possible values.
Chi-square test evaluates whether the relationship between a feature and the target variable is statistically significant.
Mean decrease impurity ranks feature importance by measuring the total reduction in node impurity a feature provides across the tree.

Pruning techniques

Without pruning, a decision tree will keep splitting until every leaf is perfectly pure, which almost always means overfitting. Pruning cuts back the tree to improve generalization.

Pre-pruning stops tree growth early by setting constraints during training, such as limiting maximum depth, requiring a minimum number of samples per leaf, or setting a minimum impurity decrease for a split to occur.
Post-pruning grows the full tree first, then removes branches that don't meaningfully improve performance.
- Cost complexity pruning balances tree size against classification error using a complexity parameter.
- Reduced error pruning replaces subtrees with leaf nodes whenever doing so doesn't decrease accuracy on a validation set.
Minimal cost-complexity pruning searches for the subtree that achieves the lowest cost-complexity measure, giving you the best trade-off between simplicity and accuracy.

Random forest overview

A random forest is an ensemble of many decision trees, each trained slightly differently. The forest aggregates their individual predictions (by majority vote for classification, or averaging for regression) to produce a final output that's more accurate and stable than any single tree.

Ensemble learning principles

The core idea is "wisdom of the crowd": many imperfect models, when combined, often outperform a single highly tuned model. For this to work, the individual trees need to be diverse, meaning they make different errors. Random forests achieve diversity through two mechanisms: bootstrapping the training data and randomly restricting which features each split can consider.

Bootstrapping creates a different training subset for each tree by sampling with replacement
Aggregation combines predictions from all trees into one final output
Parallel processing is straightforward since each tree trains independently of the others

Bagging vs boosting

These are two distinct ensemble strategies, and it's worth understanding how they differ:

Bagging (Bootstrap Aggregating) builds trees independently and in parallel. Each tree trains on a random subset of the data sampled with replacement. This primarily reduces variance and helps prevent overfitting. Random forests use bagging.
Boosting builds trees sequentially, where each new tree focuses on correcting the mistakes of the previous ones. Misclassified samples receive higher weights in subsequent iterations. This primarily reduces bias. Gradient boosting machines (GBMs) use this approach.

The key distinction: bagging keeps all samples equally weighted, while boosting adjusts weights to focus on hard-to-classify examples.

Training decision trees

Training a decision tree for computer vision means selecting the right image features, encoding them properly, and handling any gaps in the data. The goal is to build a tree that captures meaningful visual patterns without memorizing noise.

Feature selection

Filter methods rank features using statistical measures like correlation or chi-square tests before training begins
Wrapper methods use search algorithms (e.g., recursive feature elimination) to find the best feature subset by evaluating model performance
Embedded methods perform feature selection during training itself, such as L1 regularization, which drives unimportant feature weights to zero
PCA (Principal Component Analysis) reduces dimensionality by transforming features into a smaller set of uncorrelated components
Mutual information measures how much knowing a feature's value tells you about the target variable

Handling categorical variables

Image features are often numerical (pixel intensities, edge magnitudes), but some features like color labels or texture categories are categorical. Common encoding strategies:

One-hot encoding creates a separate binary column for each category
Label encoding assigns a unique integer to each category (be cautious, since trees may interpret the ordering as meaningful)
Binary encoding represents categories as binary codes, reducing dimensionality compared to one-hot
Target encoding replaces each category with the mean target value for that category
Frequency encoding replaces categories with how often they appear in the dataset

Dealing with missing data

Deletion removes samples or features with missing values (simple but can lose useful data)
Mean/median/mode imputation fills missing values with the central tendency of that feature
KNN imputation estimates missing values using the values from the most similar samples
Multiple imputation generates several plausible filled-in datasets and averages results across them
Predictive models train a separate model to estimate missing values based on other features

Random forest construction

Building a random forest involves choosing how many trees to grow, how to sample the data, and how much randomness to inject into each tree's construction.

Number of trees

More trees generally improve performance, but with diminishing returns. Going from 10 to 100 trees makes a big difference; going from 500 to 1000 usually doesn't. Typical ranges fall between 100 and 1000 trees. Use cross-validation to find the sweet spot for your dataset, keeping in mind the trade-off between accuracy and computational cost.

Structure of decision trees, scikit-learn による決定木構築 | Python / scikit-learn を利用した決定木の作成

Bootstrap sampling

Each tree in the forest trains on a bootstrap sample: a dataset created by sampling with replacement from the original training set. This means some samples appear multiple times while others are left out entirely.

On average, about 63.2% of unique samples end up in each bootstrap sample
The remaining ~36.8% are called out-of-bag (OOB) samples and can serve as a built-in validation set
This process ensures each tree sees a slightly different version of the data, which drives the diversity that makes the ensemble effective

Feature randomness

At each split, the tree doesn't consider all available features. Instead, it randomly selects a subset and picks the best split from that subset only.

For classification tasks, the typical subset size is $\sqrt{p}$ , where $p$ is the total number of features
For regression tasks, the typical subset size is $p/3$
This forces trees to use different features in their splits, reducing correlation between trees and preventing any single dominant feature from appearing in every tree

Advantages and limitations

Decision trees vs random forests

Aspect	Decision Trees	Random Forests
Interpretability	High (easy to visualize and explain)	Low (hundreds of trees are hard to inspect)
Overfitting risk	High (especially deep trees)	Low (ensemble averaging smooths out noise)
Training speed	Fast	Slower (scales with number of trees)
High-dimensional data	Can struggle	Handles it well
Generalization	Weaker	Stronger

Overfitting prevention

Random forests have several built-in defenses against overfitting:

Ensemble averaging smooths out the noise that individual trees pick up
Bagging ensures each tree trains on a different data subset, so they don't all memorize the same patterns
Feature randomness prevents any single feature from dominating across all trees
Pruning (applied to individual trees) controls complexity at the single-tree level
Cross-validation helps optimize hyperparameters to find the right balance

Computational complexity

Training time increases roughly linearly with the number of trees
Prediction time for a single sample scales with tree depth (logarithmic in the number of training samples for balanced trees), multiplied by the number of trees
Trees in a forest are independent, so training and prediction both parallelize well
Memory requirements grow with the number and depth of trees
Feature importance calculations add some overhead but are generally not the bottleneck

Hyperparameter tuning

Tuning hyperparameters is how you adapt decision trees and random forests to your specific dataset. The three most impactful parameters are tree depth, minimum samples per leaf, and the number of features considered at each split.

Tree depth

Controls the maximum number of levels from root to leaf
Deeper trees capture more complex patterns but risk overfitting to training noise
Shallower trees generalize better but may underfit if the data has complex structure
Use grid search, random search, or early stopping based on validation performance to find the right depth

Minimum samples per leaf

Sets the smallest number of training samples that can end up in a leaf node
Larger values (e.g., 10-50) force the tree to make broader generalizations, reducing overfitting
Smaller values (e.g., 1-5) let the tree learn fine-grained patterns, but those patterns may just be noise
Can be specified as a fixed count or a percentage of total samples

Number of features

Determines how many features are randomly selected as candidates at each split
Default is $\sqrt{p}$ for classification and $p/3$ for regression, where $p$ is the total feature count
Increasing this value gives each tree more information per split but reduces diversity across the forest
Decreasing it adds more randomness, which can help prevent overfitting but may hurt individual tree quality

Evaluation metrics

Accuracy and precision

Accuracy is the fraction of all predictions that are correct: $\frac{\text{correct predictions}}{\text{total predictions}}$
Precision is the fraction of positive predictions that are actually positive: $\frac{TP}{TP + FP}$
Recall (sensitivity) is the fraction of actual positives that are correctly identified: $\frac{TP}{TP + FN}$
F1-score is the harmonic mean of precision and recall, useful when you need to balance both: $F1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}$
ROC-AUC measures classification performance across all decision thresholds, with 1.0 being perfect and 0.5 being random chance

Structure of decision trees, Using a Decision Tree | Principles of Management

Gini impurity

Gini impurity tells you how "mixed" a node is. If a node contains only one class, its Gini impurity is 0 (perfectly pure). For binary classification, the maximum impurity is 0.5 (an even 50/50 split).

The formula is:

$Gini = 1 - \sum_{i=1}^{c} p_i^2$

where $p_i$ is the proportion of samples belonging to class $i$ , and $c$ is the number of classes. For example, a node with 70% class A and 30% class B has Gini impurity of $1 - (0.7^2 + 0.3^2) = 1 - 0.58 = 0.42$ .

Information gain

Information gain measures how much a split reduces uncertainty (entropy) in the data. The tree picks the split that maximizes information gain.

Entropy is defined as:

$H = -\sum_{i=1}^{c} p_i \log_2(p_i)$

Information gain is then:

$IG = H(\text{parent}) - \sum_{j} \frac{N_j}{N} H(\text{child}_j)$

where $N_j$ is the number of samples in child node $j$ and $N$ is the total samples in the parent. Higher information gain means the split does a better job of separating classes.

Applications in computer vision

Image classification tasks

Decision trees and random forests can classify images by learning rules over extracted features (color histograms, texture descriptors, edge statistics, etc.):

Object/scene categorization into predefined classes
Texture classification for material recognition or surface analysis
Facial expression recognition for emotion detection systems
Medical image classification for tasks like tumor detection or disease diagnosis
Satellite image classification for land use and land cover mapping

Object detection

For object detection, tree-based models typically work on features extracted from candidate regions:

Locating and identifying multiple objects within a single image
Bounding box regression for precise object localization
Using feature importance to identify which visual cues matter most for detection
Hierarchical detection strategies that mirror the tree's own structure
Ensemble methods that improve detection accuracy by combining multiple models

Feature importance analysis

One of the biggest practical advantages of random forests is that they naturally rank features by importance:

You can identify which image features (color channels, texture measures, shape descriptors) contribute most to classification
This guides feature engineering: if edge features dominate, you might invest in better edge detection
Feature importance helps with dimensionality reduction by letting you drop uninformative features
It also aids in understanding what the model is actually looking at, which is valuable for debugging and building trust in the system

Visualization techniques

Tree structure representation

Node-link diagrams are the most common way to display a decision tree's hierarchical structure
Color-coding nodes by class probability or impurity makes it easy to spot where the tree is confident vs uncertain
Interactive visualizations let you expand and collapse subtrees to explore different levels
Pruned tree visualizations highlight only the most important decision paths
Sankey diagrams can represent the flow of samples through the tree, showing how data gets partitioned at each split

Feature importance plots

Bar charts ranking features by importance score are the standard visualization
Heat maps can show how feature importance varies across multiple trees in the forest
Scatter plots of feature importance vs. feature correlation help identify redundant features
Grouped bar charts comparing importance across different classes reveal which features matter for which categories

Decision boundaries

2D scatter plots with overlaid decision boundaries work well when you're looking at two features at a time
Contour plots show predicted class probabilities across feature space
Partial dependence plots illustrate how changing one feature affects the model's prediction while averaging over all other features
These visualizations are especially useful for understanding how the model partitions feature space in image classification tasks

Advanced concepts

Extremely randomized trees

Extremely randomized trees (Extra-Trees) push the randomness further than standard random forests. Instead of searching for the optimal split threshold for each candidate feature, they pick split thresholds at random. This has two effects:

Training is faster because there's no search for the best threshold
Variance is reduced even further compared to standard random forests
The trade-off is slightly higher bias, but in practice Extra-Trees often generalize well on computer vision tasks

Gradient boosting machines

Unlike random forests (which use bagging), gradient boosting builds trees sequentially:

Train an initial tree on the data
Compute the residual errors (the gap between predictions and true values)
Train the next tree to predict those residuals
Add the new tree's predictions to the running total
Repeat until you reach the desired number of trees

Each tree corrects the mistakes of the ensemble so far. This typically produces stronger predictive models than random forests, but requires more careful tuning to avoid overfitting. Popular implementations include XGBoost, LightGBM, and CatBoost.

Random forest vs deep learning

Aspect	Random Forests	Deep Learning
Dataset size	Work well on small to medium datasets	Need large datasets to shine
Feature engineering	Require hand-crafted features	Learn features automatically from raw pixels
Interpretability	Feature importance is straightforward	Often treated as a black box
Overfitting on small data	More resistant	More prone
Large-scale image recognition	Competitive but limited	State of the art

Hybrid approaches exist that use deep learning to extract features and then feed those features into a random forest for the final classification, combining the representation power of neural networks with the robustness of ensemble methods.