Fiveable

🖼️Images as Data Unit 9 Review

QR code for Images as Data practice questions

9.6 You Only Look Once (YOLO) algorithm

9.6 You Only Look Once (YOLO) algorithm

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🖼️Images as Data
Unit & Topic Study Guides

YOLO revolutionizes object detection in computer vision by processing entire images in a single pass. It divides images into a grid system, predicting bounding boxes and class probabilities simultaneously, enabling real-time detection for various applications.

YOLO's architecture uses a grid-based approach, splitting images into cells for parallel processing. It predicts bounding boxes and class probabilities for each cell, balancing speed and accuracy. This makes it ideal for real-time applications, despite some limitations with small objects.

Overview of YOLO algorithm

  • YOLO (You Only Look Once) revolutionizes object detection in computer vision by processing entire images in a single forward pass
  • Divides images into a grid system, predicting bounding boxes and class probabilities simultaneously
  • Enables real-time object detection, crucial for applications in autonomous vehicles, robotics, and video surveillance

Architecture of YOLO

Grid-based approach

  • Splits input image into an SxS grid, typically 13x13 or 19x19
  • Each grid cell responsible for detecting objects centered within it
  • Allows parallel processing of multiple regions, significantly speeding up detection
  • Enables detection of multiple objects in different parts of the image simultaneously

Bounding box prediction

  • Predicts B bounding boxes per grid cell, each with 5 components: x, y, w, h, and confidence
  • (x,y) represents the center coordinates relative to the grid cell
  • (w,h) denotes width and height relative to the entire image
  • Confidence score reflects the likelihood of an object's presence and accuracy of the box
  • Uses anchor boxes to improve prediction of objects with varying aspect ratios

Class probability estimation

  • Predicts C class probabilities for each grid cell
  • Class probabilities conditioned on the grid cell containing an object
  • Combines bounding box confidence and class probabilities to yield final detection scores
  • Utilizes non-max suppression to eliminate redundant detections

YOLO vs traditional methods

Speed comparison

  • YOLO processes images at 45-155 FPS, significantly faster than R-CNN or Fast R-CNN
  • Achieves real-time detection on standard GPUs, enabling video processing at 30 FPS
  • Eliminates separate region proposal and feature resampling stages, reducing computational overhead
  • Unified network architecture allows end-to-end optimization, further improving speed

Accuracy trade-offs

  • Generally lower mean Average Precision (mAP) compared to two-stage detectors like Faster R-CNN
  • Struggles with small objects and objects in dense groups due to spatial constraints
  • Excels in detecting objects with strong contextual information
  • Balances speed and accuracy, making it suitable for real-time applications where some accuracy can be sacrificed for speed

YOLO versions evolution

YOLOv1 to YOLOv5

  • YOLOv1 introduced the core concept of single-stage detection
  • YOLOv2 (YOLO9000) added batch normalization and anchor boxes
  • YOLOv3 incorporated feature pyramid networks and multi-scale predictions
  • YOLOv4 introduced new data augmentation techniques and architectural changes
  • YOLOv5 optimized for production environments with improved scalability

Key improvements

  • Increased depth and width of network architecture for better feature extraction
  • Introduced anchor-free detection in later versions for improved flexibility
  • Enhanced data augmentation techniques like mosaic augmentation
  • Implemented focal loss to address class imbalance issues
  • Integrated attention mechanisms for better feature selection
Grid-based approach, A Vehicle Detection Method for Aerial Image Based on YOLO

YOLO implementation

Network structure

  • Utilizes a convolutional neural network (CNN) backbone for feature extraction
  • Employs darknet architecture, with variations across different YOLO versions
  • Includes detection layers that output bounding box coordinates and class probabilities
  • Incorporates skip connections to preserve fine-grained features for small object detection

Loss function

  • Combines localization loss, confidence loss, and classification loss
  • Localization loss: λcoordi=0S2j=0B1ijobj[(xix^i)2+(yiy^i)2+(wiw^i)2+(hih^i)2]\lambda_{coord} \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{1}_{ij}^{obj} [(x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 + (w_i - \hat{w}_i)^2 + (h_i - \hat{h}_i)^2]
  • Confidence loss: i=0S2j=0B1ijobj(CiC^i)2+λnoobji=0S2j=0B1ijnoobj(CiC^i)2\sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{1}_{ij}^{obj} (C_i - \hat{C}_i)^2 + \lambda_{noobj} \sum_{i=0}^{S^2} \sum_{j=0}^B \mathbb{1}_{ij}^{noobj} (C_i - \hat{C}_i)^2
  • Classification loss: i=0S21iobjcclasses(pi(c)p^i(c))2\sum_{i=0}^{S^2} \mathbb{1}_i^{obj} \sum_{c \in classes} (p_i(c) - \hat{p}_i(c))^2

Training process

  • Requires large datasets with annotated bounding boxes and class labels
  • Utilizes data augmentation techniques to increase dataset diversity
  • Employs transfer learning from pre-trained models on large datasets (ImageNet)
  • Uses multi-scale training to improve detection across various object sizes
  • Implements learning rate scheduling and early stopping for optimal convergence

Real-time object detection

YOLO in video streams

  • Processes video frames in real-time, typically achieving 30+ FPS on modern GPUs
  • Maintains temporal consistency through frame-to-frame tracking algorithms
  • Implements frame skipping techniques to balance between speed and accuracy
  • Utilizes GPU memory efficiently to handle continuous stream of input frames

Applications in surveillance

  • Enables real-time monitoring of large areas for security purposes
  • Detects and tracks multiple objects simultaneously in crowded scenes
  • Integrates with alarm systems for immediate threat detection and response
  • Facilitates automated crowd counting and behavior analysis in public spaces

YOLO limitations

Small object detection

  • Struggles with objects occupying less than 4% of the image area
  • Grid-based approach limits the number of detections per cell
  • Coarse features in deeper layers reduce sensitivity to small objects
  • Potential solutions include using higher resolution input or feature pyramid networks

Dense object scenes

  • Performance degrades in images with many objects close together
  • Grid cells can only predict a fixed number of objects, leading to missed detections
  • Difficulty in separating overlapping bounding boxes accurately
  • Anchor box predictions may not align well with densely packed objects of varying sizes
Grid-based approach, Frontiers | Lightweight object detection algorithm based on YOLOv5 for unmanned surface vehicles

YOLO optimizations

Anchor boxes

  • Predefined bounding box shapes to improve detection of objects with specific aspect ratios
  • Typically 3-5 anchor boxes per grid cell, learned from training data
  • Improves localization accuracy for objects with consistent shapes
  • Allows network to specialize in detecting objects of different scales and aspect ratios

Feature pyramid networks

  • Integrates features from multiple scales to improve detection across object sizes
  • Combines high-resolution, semantically weak features with low-resolution, semantically strong features
  • Enables better small object detection without significant computational overhead
  • Improves the network's ability to handle scale variations in input images

Transfer learning with YOLO

Pre-trained models

  • Utilize models trained on large datasets (COCO, Pascal VOC) as starting points
  • Provide strong feature extractors capable of recognizing general object characteristics
  • Reduce training time and data requirements for new tasks
  • Available for various YOLO versions and architectures (Darknet, PyTorch implementations)

Fine-tuning for custom datasets

  • Adapts pre-trained models to specific domains or object classes
  • Involves retraining the final layers or entire network with a lower learning rate
  • Requires careful balancing of new and old knowledge to prevent catastrophic forgetting
  • Often utilizes techniques like gradual unfreezing and discriminative fine-tuning

YOLO in embedded systems

Mobile and edge devices

  • Optimized versions of YOLO (Tiny YOLO) designed for resource-constrained environments
  • Utilizes model compression techniques like pruning and quantization
  • Implements efficient inference engines (TensorRT, OpenVINO) for hardware acceleration
  • Enables on-device object detection without relying on cloud processing

Resource constraints

  • Balances model size, inference speed, and accuracy for embedded deployment
  • Adapts to limited memory and computational power of edge devices
  • Utilizes low-precision arithmetic (INT8, FP16) to reduce memory and computation requirements
  • Implements power-efficient inference techniques for battery-operated devices

Future of YOLO

Recent developments

  • Exploration of transformer-based architectures for object detection
  • Integration of self-supervised learning techniques to reduce annotation requirements
  • Development of more efficient backbones specifically designed for object detection
  • Investigation of neural architecture search for automated YOLO optimization

Integration with other techniques

  • Combining YOLO with instance segmentation for more detailed object analysis
  • Incorporating 3D object detection capabilities for applications in autonomous driving
  • Exploring multi-modal fusion (RGB + depth, thermal imaging) for robust detection
  • Integrating YOLO with tracking algorithms for long-term object persistence in videos
Pep mascot
Upgrade your Fiveable account to print any study guide

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Click below to go to billing portal → update your plan → choose Yearly → and select "Fiveable Share Plan". Only pay the difference

Plan is open to all students, teachers, parents, etc
Pep mascot
Upgrade your Fiveable account to export vocabulary

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Plan is open to all students, teachers, parents, etc
report an error
description

screenshots help us find and fix the issue faster (optional)

add screenshot

2,589 studying →