Fiveable

🧐Deep Learning Systems Unit 4 Review

QR code for Deep Learning Systems practice questions

4.3 Stochastic gradient descent and mini-batch training

4.3 Stochastic gradient descent and mini-batch training

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
🧐Deep Learning Systems
Unit & Topic Study Guides

Gradient descent variations and optimization techniques are crucial for efficient deep learning. From stochastic gradient descent to mini-batch training, these methods balance speed and accuracy, enabling faster convergence and better generalization on large datasets.

Batch size effects and stabilization techniques further refine the training process. By understanding these approaches, we can optimize neural network performance, overcome local minima, and adapt to different problem domains and hardware configurations.

Gradient Descent Variations and Optimization Techniques

Motivation for stochastic gradient descent

  • Batch gradient descent computationally expensive for large datasets slows training
  • SGD updates parameters after each training example approximates true gradient
  • Faster iteration and convergence allows escape from local minima
  • Better generalization due to noise in updates introduces regularization effect
  • Reduced memory requirements suitable for online learning (streaming data)

Implementation of mini-batch training

  • Mini-batch uses subset of training data (32, 64, 128, or 256 samples) for each update
  • Shuffle training data divides into mini-batches computes gradient updates parameters
  • Balances computational efficiency and estimation accuracy reduces variance
  • Utilizes GPU and multi-core CPU architectures improves training speed
  • Enables parallelization benefits on modern hardware (TPUs, distributed systems)
Motivation for stochastic gradient descent, Gradient Descent and its Variants

Effects of batch size

  • Small batch sizes faster initial progress higher variance potential for better generalization
  • Large batch sizes more stable gradient estimates slower convergence risk of poor generalization
  • Batch size impacts learning rate larger batches may require larger rates (lrbatch_sizelr \propto batch\_size)
  • Generalization gap often smaller with smaller batch sizes (train-test performance difference)
  • Adaptive techniques gradually increase batch size (batch size warm-up)

Techniques for stabilizing SGD

  • Momentum accumulates past gradients helps overcome local minima (vt=γvt1+ηJ(θ)v_t = \gamma v_{t-1} + \eta \nabla J(\theta))
  • Nesterov accelerated gradient look-ahead modification more responsive to changes
  • Gradient noise adds Gaussian noise helps escape sharp minima improves generalization
  • Learning rate schedules (step decay, exponential decay, cosine annealing)
  • Adaptive methods (AdaGrad, RMSprop, Adam) adjust learning rates per parameter
Pep mascot
Upgrade your Fiveable account to print any study guide

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Click below to go to billing portal → update your plan → choose Yearly → and select "Fiveable Share Plan". Only pay the difference

Plan is open to all students, teachers, parents, etc
Pep mascot
Upgrade your Fiveable account to export vocabulary

Download study guides as beautiful PDFs See example

Print or share PDFs with your students

Always prints our latest, updated content

Mark up and annotate as you study

Plan is open to all students, teachers, parents, etc
report an error
description

screenshots help us find and fix the issue faster (optional)

add screenshot

2,589 studying →