Decision trees and random forests are classification tools that break down complex visual data into a series of simple decision rules. In computer vision, they're used for tasks like image classification and object detection because they're interpretable, efficient, and handle high-dimensional feature data well.
Random forests take the decision tree concept further by combining many trees into an ensemble. This reduces overfitting and captures more intricate patterns in image features than any single tree could manage on its own.
Decision tree fundamentals
A decision tree works by splitting data through a series of yes/no questions about feature values. Each question narrows down the possibilities until you reach a final prediction. Think of it as a flowchart: you start at the top, answer questions about your data at each step, and follow the appropriate branch until you land on a classification.
Structure of decision trees
- Root node represents the entire dataset and asks the first splitting question
- Internal nodes correspond to feature-based decisions that split the data further
- Leaf nodes contain the final predictions or class labels
- Branches connect nodes, representing the possible outcomes of each decision
- Tree depth (the number of levels from root to the deepest leaf) controls how complex and granular the decision-making process gets
Splitting criteria
At each internal node, the tree needs to decide which feature and threshold to split on. Several criteria can guide that choice:
- Gini impurity measures the probability of misclassifying a randomly chosen element from the node. Lower is better.
- Information gain quantifies how much entropy (disorder) decreases after a split. Higher gain means a more informative split.
- Gain ratio normalizes information gain to avoid bias toward features with many possible values.
- Chi-square test evaluates whether the relationship between a feature and the target variable is statistically significant.
- Mean decrease impurity ranks feature importance by measuring the total reduction in node impurity a feature provides across the tree.
Pruning techniques
Without pruning, a decision tree will keep splitting until every leaf is perfectly pure, which almost always means overfitting. Pruning cuts back the tree to improve generalization.
- Pre-pruning stops tree growth early by setting constraints during training, such as limiting maximum depth, requiring a minimum number of samples per leaf, or setting a minimum impurity decrease for a split to occur.
- Post-pruning grows the full tree first, then removes branches that don't meaningfully improve performance.
- Cost complexity pruning balances tree size against classification error using a complexity parameter.
- Reduced error pruning replaces subtrees with leaf nodes whenever doing so doesn't decrease accuracy on a validation set.
- Minimal cost-complexity pruning searches for the subtree that achieves the lowest cost-complexity measure, giving you the best trade-off between simplicity and accuracy.
Random forest overview
A random forest is an ensemble of many decision trees, each trained slightly differently. The forest aggregates their individual predictions (by majority vote for classification, or averaging for regression) to produce a final output that's more accurate and stable than any single tree.
Ensemble learning principles
The core idea is "wisdom of the crowd": many imperfect models, when combined, often outperform a single highly tuned model. For this to work, the individual trees need to be diverse, meaning they make different errors. Random forests achieve diversity through two mechanisms: bootstrapping the training data and randomly restricting which features each split can consider.
- Bootstrapping creates a different training subset for each tree by sampling with replacement
- Aggregation combines predictions from all trees into one final output
- Parallel processing is straightforward since each tree trains independently of the others
Bagging vs boosting
These are two distinct ensemble strategies, and it's worth understanding how they differ:
- Bagging (Bootstrap Aggregating) builds trees independently and in parallel. Each tree trains on a random subset of the data sampled with replacement. This primarily reduces variance and helps prevent overfitting. Random forests use bagging.
- Boosting builds trees sequentially, where each new tree focuses on correcting the mistakes of the previous ones. Misclassified samples receive higher weights in subsequent iterations. This primarily reduces bias. Gradient boosting machines (GBMs) use this approach.
The key distinction: bagging keeps all samples equally weighted, while boosting adjusts weights to focus on hard-to-classify examples.
Training decision trees
Training a decision tree for computer vision means selecting the right image features, encoding them properly, and handling any gaps in the data. The goal is to build a tree that captures meaningful visual patterns without memorizing noise.
Feature selection
- Filter methods rank features using statistical measures like correlation or chi-square tests before training begins
- Wrapper methods use search algorithms (e.g., recursive feature elimination) to find the best feature subset by evaluating model performance
- Embedded methods perform feature selection during training itself, such as L1 regularization, which drives unimportant feature weights to zero
- PCA (Principal Component Analysis) reduces dimensionality by transforming features into a smaller set of uncorrelated components
- Mutual information measures how much knowing a feature's value tells you about the target variable
Handling categorical variables
Image features are often numerical (pixel intensities, edge magnitudes), but some features like color labels or texture categories are categorical. Common encoding strategies:
- One-hot encoding creates a separate binary column for each category
- Label encoding assigns a unique integer to each category (be cautious, since trees may interpret the ordering as meaningful)
- Binary encoding represents categories as binary codes, reducing dimensionality compared to one-hot
- Target encoding replaces each category with the mean target value for that category
- Frequency encoding replaces categories with how often they appear in the dataset
Dealing with missing data
- Deletion removes samples or features with missing values (simple but can lose useful data)
- Mean/median/mode imputation fills missing values with the central tendency of that feature
- KNN imputation estimates missing values using the values from the most similar samples
- Multiple imputation generates several plausible filled-in datasets and averages results across them
- Predictive models train a separate model to estimate missing values based on other features
Random forest construction
Building a random forest involves choosing how many trees to grow, how to sample the data, and how much randomness to inject into each tree's construction.
Number of trees
More trees generally improve performance, but with diminishing returns. Going from 10 to 100 trees makes a big difference; going from 500 to 1000 usually doesn't. Typical ranges fall between 100 and 1000 trees. Use cross-validation to find the sweet spot for your dataset, keeping in mind the trade-off between accuracy and computational cost.

Bootstrap sampling
Each tree in the forest trains on a bootstrap sample: a dataset created by sampling with replacement from the original training set. This means some samples appear multiple times while others are left out entirely.
- On average, about 63.2% of unique samples end up in each bootstrap sample
- The remaining ~36.8% are called out-of-bag (OOB) samples and can serve as a built-in validation set
- This process ensures each tree sees a slightly different version of the data, which drives the diversity that makes the ensemble effective
Feature randomness
At each split, the tree doesn't consider all available features. Instead, it randomly selects a subset and picks the best split from that subset only.
- For classification tasks, the typical subset size is , where is the total number of features
- For regression tasks, the typical subset size is
- This forces trees to use different features in their splits, reducing correlation between trees and preventing any single dominant feature from appearing in every tree
Advantages and limitations
Decision trees vs random forests
| Aspect | Decision Trees | Random Forests |
|---|---|---|
| Interpretability | High (easy to visualize and explain) | Low (hundreds of trees are hard to inspect) |
| Overfitting risk | High (especially deep trees) | Low (ensemble averaging smooths out noise) |
| Training speed | Fast | Slower (scales with number of trees) |
| High-dimensional data | Can struggle | Handles it well |
| Generalization | Weaker | Stronger |
Overfitting prevention
Random forests have several built-in defenses against overfitting:
- Ensemble averaging smooths out the noise that individual trees pick up
- Bagging ensures each tree trains on a different data subset, so they don't all memorize the same patterns
- Feature randomness prevents any single feature from dominating across all trees
- Pruning (applied to individual trees) controls complexity at the single-tree level
- Cross-validation helps optimize hyperparameters to find the right balance
Computational complexity
- Training time increases roughly linearly with the number of trees
- Prediction time for a single sample scales with tree depth (logarithmic in the number of training samples for balanced trees), multiplied by the number of trees
- Trees in a forest are independent, so training and prediction both parallelize well
- Memory requirements grow with the number and depth of trees
- Feature importance calculations add some overhead but are generally not the bottleneck
Hyperparameter tuning
Tuning hyperparameters is how you adapt decision trees and random forests to your specific dataset. The three most impactful parameters are tree depth, minimum samples per leaf, and the number of features considered at each split.
Tree depth
- Controls the maximum number of levels from root to leaf
- Deeper trees capture more complex patterns but risk overfitting to training noise
- Shallower trees generalize better but may underfit if the data has complex structure
- Use grid search, random search, or early stopping based on validation performance to find the right depth
Minimum samples per leaf
- Sets the smallest number of training samples that can end up in a leaf node
- Larger values (e.g., 10-50) force the tree to make broader generalizations, reducing overfitting
- Smaller values (e.g., 1-5) let the tree learn fine-grained patterns, but those patterns may just be noise
- Can be specified as a fixed count or a percentage of total samples
Number of features
- Determines how many features are randomly selected as candidates at each split
- Default is for classification and for regression, where is the total feature count
- Increasing this value gives each tree more information per split but reduces diversity across the forest
- Decreasing it adds more randomness, which can help prevent overfitting but may hurt individual tree quality
Evaluation metrics
Accuracy and precision
- Accuracy is the fraction of all predictions that are correct:
- Precision is the fraction of positive predictions that are actually positive:
- Recall (sensitivity) is the fraction of actual positives that are correctly identified:
- F1-score is the harmonic mean of precision and recall, useful when you need to balance both:
- ROC-AUC measures classification performance across all decision thresholds, with 1.0 being perfect and 0.5 being random chance

Gini impurity
Gini impurity tells you how "mixed" a node is. If a node contains only one class, its Gini impurity is 0 (perfectly pure). For binary classification, the maximum impurity is 0.5 (an even 50/50 split).
The formula is:
where is the proportion of samples belonging to class , and is the number of classes. For example, a node with 70% class A and 30% class B has Gini impurity of .
Information gain
Information gain measures how much a split reduces uncertainty (entropy) in the data. The tree picks the split that maximizes information gain.
Entropy is defined as:
Information gain is then:
where is the number of samples in child node and is the total samples in the parent. Higher information gain means the split does a better job of separating classes.
Applications in computer vision
Image classification tasks
Decision trees and random forests can classify images by learning rules over extracted features (color histograms, texture descriptors, edge statistics, etc.):
- Object/scene categorization into predefined classes
- Texture classification for material recognition or surface analysis
- Facial expression recognition for emotion detection systems
- Medical image classification for tasks like tumor detection or disease diagnosis
- Satellite image classification for land use and land cover mapping
Object detection
For object detection, tree-based models typically work on features extracted from candidate regions:
- Locating and identifying multiple objects within a single image
- Bounding box regression for precise object localization
- Using feature importance to identify which visual cues matter most for detection
- Hierarchical detection strategies that mirror the tree's own structure
- Ensemble methods that improve detection accuracy by combining multiple models
Feature importance analysis
One of the biggest practical advantages of random forests is that they naturally rank features by importance:
- You can identify which image features (color channels, texture measures, shape descriptors) contribute most to classification
- This guides feature engineering: if edge features dominate, you might invest in better edge detection
- Feature importance helps with dimensionality reduction by letting you drop uninformative features
- It also aids in understanding what the model is actually looking at, which is valuable for debugging and building trust in the system
Visualization techniques
Tree structure representation
- Node-link diagrams are the most common way to display a decision tree's hierarchical structure
- Color-coding nodes by class probability or impurity makes it easy to spot where the tree is confident vs uncertain
- Interactive visualizations let you expand and collapse subtrees to explore different levels
- Pruned tree visualizations highlight only the most important decision paths
- Sankey diagrams can represent the flow of samples through the tree, showing how data gets partitioned at each split
Feature importance plots
- Bar charts ranking features by importance score are the standard visualization
- Heat maps can show how feature importance varies across multiple trees in the forest
- Scatter plots of feature importance vs. feature correlation help identify redundant features
- Grouped bar charts comparing importance across different classes reveal which features matter for which categories
Decision boundaries
- 2D scatter plots with overlaid decision boundaries work well when you're looking at two features at a time
- Contour plots show predicted class probabilities across feature space
- Partial dependence plots illustrate how changing one feature affects the model's prediction while averaging over all other features
- These visualizations are especially useful for understanding how the model partitions feature space in image classification tasks
Advanced concepts
Extremely randomized trees
Extremely randomized trees (Extra-Trees) push the randomness further than standard random forests. Instead of searching for the optimal split threshold for each candidate feature, they pick split thresholds at random. This has two effects:
- Training is faster because there's no search for the best threshold
- Variance is reduced even further compared to standard random forests
- The trade-off is slightly higher bias, but in practice Extra-Trees often generalize well on computer vision tasks
Gradient boosting machines
Unlike random forests (which use bagging), gradient boosting builds trees sequentially:
- Train an initial tree on the data
- Compute the residual errors (the gap between predictions and true values)
- Train the next tree to predict those residuals
- Add the new tree's predictions to the running total
- Repeat until you reach the desired number of trees
Each tree corrects the mistakes of the ensemble so far. This typically produces stronger predictive models than random forests, but requires more careful tuning to avoid overfitting. Popular implementations include XGBoost, LightGBM, and CatBoost.
Random forest vs deep learning
| Aspect | Random Forests | Deep Learning |
|---|---|---|
| Dataset size | Work well on small to medium datasets | Need large datasets to shine |
| Feature engineering | Require hand-crafted features | Learn features automatically from raw pixels |
| Interpretability | Feature importance is straightforward | Often treated as a black box |
| Overfitting on small data | More resistant | More prone |
| Large-scale image recognition | Competitive but limited | State of the art |
| Hybrid approaches exist that use deep learning to extract features and then feed those features into a random forest for the final classification, combining the representation power of neural networks with the robustness of ensemble methods. |