Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Activation functions are the gatekeepers of your neural network—they determine whether and how information flows from one layer to the next. Without them, even a 100-layer network would collapse into a single linear transformation, incapable of learning anything more complex than a straight line. You're being tested on understanding why different activations exist, when to use each one, and what problems they solve (or create). The core concepts here—vanishing gradients, computational efficiency, output ranges, and gradient flow—show up repeatedly in architecture design questions.
Don't just memorize that ReLU outputs zero for negative inputs. Know why that matters for training dynamics, how it compares to sigmoid's gradient behavior, and when you'd choose a variant like Leaky ReLU instead. The best exam answers connect activation choice to the specific problem: classification vs. regression, shallow vs. deep networks, training stability vs. computational cost.
These functions were the original workhorses of neural networks. They "saturate"—meaning their gradients approach zero for extreme inputs—which creates the infamous vanishing gradient problem in deep networks.
Compare: Sigmoid vs. tanh—both saturate and cause vanishing gradients, but tanh's zero-centered output typically leads to faster convergence. If an FRQ asks about improving a sigmoid-based network without changing architecture, switching to tanh is a valid first step.
The Rectified Linear Unit revolutionized deep learning by providing constant gradients for positive inputs. The gradient is either 0 or 1—no saturation, no vanishing. But this simplicity introduces its own failure mode.
Compare: ReLU vs. Leaky ReLU vs. ELU—all solve vanishing gradients, but they handle negative inputs differently. ReLU kills them (risking dead neurons), Leaky ReLU passes a small signal, and ELU provides smooth negative outputs. For very deep networks where dying neurons are a concern, ELU often performs best despite the computational cost.
Recent research has produced activations that combine benefits of multiple approaches. These functions often learn or adapt their behavior, trading simplicity for performance.
Compare: ReLU vs. Swish—ReLU is faster and simpler, Swish often achieves better accuracy on complex tasks. When computational resources aren't constrained and you're optimizing for performance, Swish is worth testing.
These activations aren't about hidden layer dynamics—they're about matching your output to the problem type.
Compare: Softmax vs. Sigmoid for classification—sigmoid works for binary or multi-label problems (independent probabilities), softmax works for multi-class problems (mutually exclusive categories). Choosing wrong here is a common exam mistake.
| Concept | Best Examples |
|---|---|
| Vanishing gradient problem | Sigmoid, tanh (both saturate) |
| Solving vanishing gradients | ReLU, Leaky ReLU, ELU |
| Dying neuron problem | ReLU (causes it), Leaky ReLU/PReLU/ELU (solve it) |
| Zero-centered output | tanh, ELU |
| Computational efficiency | ReLU, Leaky ReLU (fastest) |
| Learnable parameters | PReLU (learns slope) |
| Multi-class classification output | Softmax |
| Regression output | Linear |
Both sigmoid and tanh suffer from vanishing gradients—what property of tanh makes it generally preferable for hidden layers despite this shared weakness?
A colleague reports that 40% of neurons in their deep ReLU network have stopped updating entirely. Which two activation functions would you recommend as replacements, and what mechanism do they use to solve this problem?
Compare and contrast ELU and Leaky ReLU: how does each handle negative inputs, and what trade-off does ELU make for its smoother gradient behavior?
You're designing a network to classify images into exactly one of 10 categories. Which output activation should you use, and why would sigmoid be inappropriate here?
If an FRQ asks you to explain why a 50-layer network with sigmoid activations fails to train while the same architecture with ReLU succeeds, what two key concepts should your answer address?