Neural network hyperparameters are the control knobs you set before training beginsโand they fundamentally determine whether your model learns effectively or fails spectacularly. You're being tested on understanding how these choices affect gradient flow, model capacity, and generalization, not just memorizing default values. The interplay between hyperparameters like learning rate, architecture depth, and regularization strength reveals your grasp of core optimization theory and the bias-variance tradeoff.
Think of hyperparameter tuning as balancing competing forces: you need enough model complexity to capture patterns (capacity) without memorizing noise (overfitting), and you need optimization steps that are bold enough to make progress but careful enough not to overshoot. Don't just memorize what each hyperparameter doesโknow why certain combinations work together and what symptoms indicate when a hyperparameter is misconfigured.
Optimization Dynamics
These hyperparameters control how the network navigates the loss landscape during training. The core mechanism is gradient descent: iteratively adjusting weights to minimize error, where the path taken depends critically on step size and momentum.
Learning Rate
Controls step size during gradient descentโtoo large causes oscillation or divergence; too small causes painfully slow convergence or getting stuck in local minima
Adaptive methods like Adam and RMSprop automatically adjust the rate per-parameter, combining benefits of momentum with per-dimension scaling
Most sensitive hyperparameter in practice; often the first thing to tune, typically starting around 10โ3 and adjusting by factors of 10
Momentum
Accumulates velocity from previous gradientsโthe update rule becomes vtโ=ฮณvtโ1โ+ฮทโL where ฮณ is the momentum coefficient
Smooths noisy gradients and accelerates convergence through flat regions of the loss surface, acting like a ball rolling downhill with inertia
Typical values range from 0.5 to 0.9; higher momentum means more influence from past updates, risking overshooting in sharp valleys
Batch Size
Number of samples per gradient updateโaffects both the quality of gradient estimates and computational efficiency
Smaller batches (32-128) introduce beneficial noise that can escape local minima but produce noisier gradient estimates; larger batches are more stable but memory-intensive
Interacts with learning rateโlarger batches often require proportionally larger learning rates to maintain training dynamics
Compare: Learning Rate vs. Momentumโboth affect how quickly you traverse the loss landscape, but learning rate controls step magnitude while momentum controls directional persistence. If an FRQ asks about oscillation during training, consider whether the issue is step size (learning rate) or accumulated velocity (momentum).
Architecture Capacity
These hyperparameters determine the expressiveness of your networkโhow complex a function it can represent. The key tradeoff is the bias-variance dilemma: more capacity reduces underfitting but increases overfitting risk.
Number of Hidden Layers
Depth enables hierarchical feature learningโeach layer can build increasingly abstract representations from the previous layer's outputs
Deeper networks capture more complex patterns but face challenges like vanishing gradients and require more careful initialization
Problem-dependent choice; simple tasks may need only 1-2 layers while image recognition often requires dozens
Number of Neurons in Each Layer
Width determines representational capacity per layerโmore neurons can capture more features but increase parameter count as O(ninโรnoutโ)
Too few neurons cause underfitting (model can't represent the target function); too many cause overfitting and computational overhead
Common heuristic: start with layer sizes between input and output dimensions, then tune based on validation performance
Activation Functions
Introduce non-linearityโwithout them, stacked layers collapse to a single linear transformation regardless of depth
ReLU (f(x)=max(0,x)) dominates modern architectures due to computational efficiency and reduced vanishing gradient issues; sigmoid and tanh still used in output layers or recurrent networks
Choice affects gradient flow; ReLU can cause "dead neurons" (zero gradient for negative inputs), while sigmoid saturates at extremes
Compare: Hidden Layers vs. Neurons per Layerโboth increase model capacity, but depth enables hierarchical abstraction while width enables parallel feature detection. Deep narrow networks learn compositional features; shallow wide networks learn many independent features.
Regularization and Generalization
These hyperparameters combat overfitting by constraining model complexity or introducing noise during training. The underlying principle is that simpler models generalize better when training data is limited or noisy.
Regularization Techniques (L1, L2)
Add penalty terms to the loss functionโL2 adds ฮปโwi2โ penalizing large weights; L1 adds ฮปโโฃwiโโฃ encouraging sparsity
L1 produces sparse models (many weights become exactly zero, useful for feature selection); L2 produces small but non-zero weights (weight decay)
Regularization strength ฮป must be tuned; too high causes underfitting, too low provides insufficient regularization
Dropout Rate
Randomly zeros neurons during trainingโeach forward pass uses a different "thinned" network, preventing co-adaptation between neurons
Acts as ensemble averagingโat test time, the full network approximates averaging predictions from many sub-networks
Typical rates of 0.2-0.5; higher dropout for larger networks or more overfitting-prone tasks; usually not applied to output layers
Compare: L2 Regularization vs. Dropoutโboth reduce overfitting, but L2 shrinks weights continuously while dropout forces redundancy by randomly removing neurons. L2 is deterministic and interpretable; dropout adds stochasticity and works especially well in large networks.
Training Duration and Initialization
These hyperparameters affect when training starts and stops effectively. Proper initialization ensures healthy gradient flow from the first step, while epoch count determines total exposure to training data.
Number of Epochs
One epoch = one complete pass through training dataโtotal training iterations equals epochs ร (dataset size / batch size)
Too few epochs cause underfitting (model hasn't converged); too many cause overfitting (model memorizes training noise)
Early stopping monitors validation loss and halts training when performance degrades, effectively making epochs a learned hyperparameter
Weight Initialization Method
Sets starting point in weight spaceโpoor initialization causes vanishing gradients (weights โ 0) or exploding gradients (weights โ โ)
Xavier/Glorot initialization scales weights by 1/ninโโ, designed for sigmoid/tanh; He initialization scales by 2/ninโโ, designed for ReLU
Breaks symmetry between neurons; identical initialization would cause all neurons in a layer to learn identical features
Compare: Xavier vs. He Initializationโboth prevent gradient pathology, but Xavier assumes linear activation variance (appropriate for tanh/sigmoid) while He accounts for ReLU zeroing half the outputs. Using Xavier with ReLU often causes vanishing gradients in deep networks.
Quick Reference Table
Concept
Best Examples
Optimization speed
Learning rate, Momentum, Batch size
Model capacity
Number of hidden layers, Neurons per layer
Non-linearity
Activation functions (ReLU, sigmoid, tanh)
Overfitting prevention
Dropout, L1/L2 regularization, Early stopping (epochs)
Gradient health
Weight initialization, Activation function choice
Bias-variance tradeoff
Layer/neuron count vs. regularization strength
Computational cost
Batch size, Network width, Number of epochs
Self-Check Questions
Which two hyperparameters both affect how quickly a network traverses the loss landscape, and how do their mechanisms differ?
If your network shows high training accuracy but poor validation accuracy, which category of hyperparameters should you adjust first, and what specific changes would you make?
Compare and contrast L1 and L2 regularization: when would you prefer one over the other, and what different effects do they have on the learned weights?
A deep ReLU network shows near-zero gradients in early layers during training. Which hyperparameter is likely misconfigured, and what initialization method would you use instead?
Explain the tradeoff involved in choosing batch size: how does a batch size of 32 differ from 512 in terms of gradient quality, convergence behavior, and computational requirements?