Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Neural network hyperparameters are the control knobs you set before training begins—and they fundamentally determine whether your model learns effectively or fails spectacularly. You're being tested on understanding how these choices affect gradient flow, model capacity, and generalization, not just memorizing default values. The interplay between hyperparameters like learning rate, architecture depth, and regularization strength reveals your grasp of core optimization theory and the bias-variance tradeoff.
Think of hyperparameter tuning as balancing competing forces: you need enough model complexity to capture patterns (capacity) without memorizing noise (overfitting), and you need optimization steps that are bold enough to make progress but careful enough not to overshoot. Don't just memorize what each hyperparameter does—know why certain combinations work together and what symptoms indicate when a hyperparameter is misconfigured.
These hyperparameters control how the network navigates the loss landscape during training. The core mechanism is gradient descent: iteratively adjusting weights to minimize error, where the path taken depends critically on step size and momentum.
Compare: Learning Rate vs. Momentum—both affect how quickly you traverse the loss landscape, but learning rate controls step magnitude while momentum controls directional persistence. If an FRQ asks about oscillation during training, consider whether the issue is step size (learning rate) or accumulated velocity (momentum).
These hyperparameters determine the expressiveness of your network—how complex a function it can represent. The key tradeoff is the bias-variance dilemma: more capacity reduces underfitting but increases overfitting risk.
Compare: Hidden Layers vs. Neurons per Layer—both increase model capacity, but depth enables hierarchical abstraction while width enables parallel feature detection. Deep narrow networks learn compositional features; shallow wide networks learn many independent features.
These hyperparameters combat overfitting by constraining model complexity or introducing noise during training. The underlying principle is that simpler models generalize better when training data is limited or noisy.
Compare: L2 Regularization vs. Dropout—both reduce overfitting, but L2 shrinks weights continuously while dropout forces redundancy by randomly removing neurons. L2 is deterministic and interpretable; dropout adds stochasticity and works especially well in large networks.
These hyperparameters affect when training starts and stops effectively. Proper initialization ensures healthy gradient flow from the first step, while epoch count determines total exposure to training data.
Compare: Xavier vs. He Initialization—both prevent gradient pathology, but Xavier assumes linear activation variance (appropriate for tanh/sigmoid) while He accounts for ReLU zeroing half the outputs. Using Xavier with ReLU often causes vanishing gradients in deep networks.
| Concept | Best Examples |
|---|---|
| Optimization speed | Learning rate, Momentum, Batch size |
| Model capacity | Number of hidden layers, Neurons per layer |
| Non-linearity | Activation functions (ReLU, sigmoid, tanh) |
| Overfitting prevention | Dropout, L1/L2 regularization, Early stopping (epochs) |
| Gradient health | Weight initialization, Activation function choice |
| Bias-variance tradeoff | Layer/neuron count vs. regularization strength |
| Computational cost | Batch size, Network width, Number of epochs |
Which two hyperparameters both affect how quickly a network traverses the loss landscape, and how do their mechanisms differ?
If your network shows high training accuracy but poor validation accuracy, which category of hyperparameters should you adjust first, and what specific changes would you make?
Compare and contrast L1 and L2 regularization: when would you prefer one over the other, and what different effects do they have on the learned weights?
A deep ReLU network shows near-zero gradients in early layers during training. Which hyperparameter is likely misconfigured, and what initialization method would you use instead?
Explain the tradeoff involved in choosing batch size: how does a batch size of 32 differ from 512 in terms of gradient quality, convergence behavior, and computational requirements?