🧐Deep Learning Systems Unit 4 Review

4.4 Challenges in training deep networks: vanishing and exploding gradients

🧐Deep Learning Systems
Unit 4 Review

4.4 Challenges in training deep networks: vanishing and exploding gradients

Written by the Fiveable Content Team • Last updated September 2025

🧐Deep Learning Systems

Unit & Topic Study Guides

4.1 Principles of backpropagation and automatic differentiation

4.2 Gradient descent algorithms and learning rate scheduling

4.3 Stochastic gradient descent and mini-batch training

4.4 Challenges in training deep networks: vanishing and exploding gradients

Deep neural networks face significant challenges in training due to gradient-related issues. Vanishing and exploding gradients can impede learning, especially in early layers, making it difficult to capture long-range dependencies and achieve stable convergence.

Various factors influence gradient propagation, including network depth, activation functions, and weight initialization. To address these challenges, researchers have developed techniques like gradient clipping, residual connections, and normalization methods to improve training stability and performance.

Gradient Challenges in Deep Networks

Challenges in deep network training

Vanishing gradients occur when gradients become extremely small during backpropagation impeding learning in early layers (sigmoid, tanh)
Exploding gradients happen when gradients become extremely large causing unstable training and numerical overflow (RNNs, deep networks)
Impact on training leads to difficulty learning long-range dependencies and slower convergence or failure to converge (language models, time series)

Factors affecting gradient propagation

Network depth amplifies gradient issues as each layer can diminish or amplify gradients (ResNet, VGGNet)
Activation functions influence gradient flow with sigmoid and tanh saturating for large inputs while ReLU mitigates vanishing gradients but can suffer from "dying ReLU" problem
Weight initialization impacts gradient propagation with Xavier/Glorot designed for sigmoid/tanh and He tailored for ReLU activations

Mitigation Techniques

Techniques for exploding gradients

Gradient clipping sets threshold for gradient magnitudes scaling down those exceeding it (RNNs, LSTMs)
Gradient normalization rescales entire gradient vector maintaining direction but adjusting magnitude
Layer-wise adaptive rate scaling applies different learning rates to each layer balancing gradient magnitudes
Gradient noise addition injects small amounts of noise to gradients helping escape sharp minima

Solutions for vanishing gradients

Residual connections allow gradients to flow directly through network maintaining strength in very deep architectures (ResNet, DenseNet)
Batch normalization normalizes inputs to each layer reducing internal covariate shift enabling higher learning rates
Layer normalization similar to batch normalization but normalizes across features useful for recurrent neural networks
Highway networks use gating mechanisms to control information flow allowing unimpeded gradient propagation
Dense connections connect each layer to every other layer in feed-forward fashion strengthening feature propagation

🧐Deep Learning Systems Unit 4 Review