Activation Functions¶
What This Is¶
Activation functions are the element-wise nonlinearities a neural network inserts between linear layers. Without them a stack of linear layers collapses algebraically into a single linear layer — the activation is what gives depth its representational power.
The practical lesson is that the activation choice is almost always ReLU (and its relatives) for hidden layers and softmax (for classification) or sigmoid (for binary and gating) for outputs. The interesting part of this topic is not the function formulas — it is the gradient-flow consequences of each choice and the diagnostic signs of their failure modes.
When You Use It¶
- between every linear layer inside a neural network
- at the output, matched to the loss: linear + MSE, sigmoid + BCE, softmax + cross-entropy
- in gating structures: LSTM gates, attention softmax, gated feedforward
- at the end of a convolutional block before pooling
The Core Set¶
| Function | Formula | Range | Derivative | Notes |
|---|---|---|---|---|
| ReLU | max(0, x) |
[0, ∞) |
1 if x > 0 else 0 |
default for hidden layers |
| Leaky ReLU | max(αx, x), α ≈ 0.01 |
(-∞, ∞) |
1 or α |
fixes dead-ReLU |
| ELU | x if x > 0 else α(e^x - 1) |
(-α, ∞) |
smooth | smoother alternative to leaky ReLU |
| GELU | x · Φ(x) |
(-∞, ∞) |
smooth | standard in transformers |
| Sigmoid | 1 / (1 + e^-x) |
(0, 1) |
σ(x)(1 - σ(x)), max 0.25 |
output activation for binary |
| Tanh | (e^x - e^-x) / (e^x + e^-x) |
(-1, 1) |
1 - tanh²(x), max 1 |
zero-centered, used in RNNs |
| Softmax | e^{x_i} / Σ_j e^{x_j} |
(0, 1) each, sum = 1 |
vector-valued Jacobian | output activation for multi-class |
| Swish / SiLU | x · σ(x) |
(-∞, ∞) |
smooth | common in modern vision backbones |
ReLU Is The Default — And Why¶
ReLU has a derivative of exactly 1 on the positive side. That matters because in the chain-rule product of backpropagation, factors of 1 do not shrink. Stacking sigmoids (each with max derivative 0.25) across 20 layers gives a gradient factor of at most 0.25^20 ≈ 10^-13 — a vanishing gradient. ReLU avoids that for all positive pre-activations.
For the underlying chain-rule picture, see Backpropagation.
Dead Neurons — The ReLU Failure Mode¶
If a ReLU unit's input becomes consistently negative early in training, its gradient is zero everywhere. No gradient means no update, so the unit never learns and never recovers. The entire neuron is dead.
Diagnostic: count the fraction of units with zero activation across a batch. Above ~30% is usually a dead-neuron problem.
Fixes:
- lower the learning rate — large updates often push units into the dead zone
- use Leaky ReLU, ELU, or GELU — these have nonzero gradient on the negative side
- use a proper initialization (see below) so pre-activations start balanced around zero
Vanishing Gradients — The Sigmoid / Tanh Failure Mode¶
Sigmoid saturates to 0 or 1 far from the origin; its derivative approaches zero in both tails. When inputs to a sigmoid are large in magnitude, the gradient through that unit is effectively zero and backpropagation cannot update upstream weights.
Tanh has the same problem (saturation at ±1) but is zero-centered, which is why it was the pre-ReLU default for hidden layers. Still, modern networks almost never use sigmoid or tanh in hidden layers — the saturating gradient is too costly in deep stacks.
When you see sigmoid or tanh:
- output layer of a binary classifier (with BCE loss)
- RNN/LSTM gates, where saturation is actually desirable for a gate
- the output of a generator with a bounded range
- legacy code — check if ReLU would help
Output Activation Must Match The Loss¶
| Task | Output Activation | Loss |
|---|---|---|
| regression | none (linear) | MSE / MAE |
| binary classification | sigmoid | binary cross-entropy |
| multi-class classification | softmax | cross-entropy |
| multi-label classification | sigmoid per label | binary cross-entropy per label |
PyTorch note: nn.CrossEntropyLoss expects raw logits, not softmax outputs. Applying softmax before CrossEntropyLoss is a very common bug. Similarly nn.BCEWithLogitsLoss expects logits, not sigmoid outputs — and it is numerically safer than BCELoss(sigmoid(x), y).
Initialization Matches The Activation¶
Activation choice and weight initialization are coupled:
- ReLU-family → Kaiming (He) initialization: variance
2 / fan_in - tanh / sigmoid → Xavier (Glorot) initialization: variance
1 / fan_inor2 / (fan_in + fan_out)
PyTorch defaults usually do this correctly if you use nn.Linear. For custom layers, call nn.init.kaiming_normal_(weight, nonlinearity="relu") or the equivalent.
See Batch Normalization and Initialization for the full story.
Minimal Example¶
import torch.nn as nn
model = nn.Sequential(
nn.Linear(64, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU(),
nn.Linear(128, 10), # raw logits; no softmax here
)
loss_fn = nn.CrossEntropyLoss() # expects logits
No activation on the final layer — CrossEntropyLoss applies log_softmax internally.
What To Inspect¶
- the fraction of ReLU units that are zero on a representative batch — above 30% is a dead-neuron problem
- gradient magnitudes per layer — shrinking by depth points at saturating nonlinearities
- pre-activation distributions — if they drift far from zero, initialization or normalization is broken
- whether the final layer matches the loss (softmax + CE both pre-applied is the classic double-softmax bug)
- what swapping ReLU for GELU or Swish does — usually small, sometimes meaningful
Failure Pattern¶
Applying softmax before CrossEntropyLoss. The loss sees already-normalized probabilities, applies its own log-softmax on top, and the gradients become doubly saturated — training goes nowhere.
A second failure pattern is placing sigmoid or tanh in hidden layers of a deep network and blaming the optimizer when gradients vanish. The activation is the cause.
A third failure pattern is ignoring dead ReLUs. The network keeps training; the accuracy plateaus early because half the capacity is silently turned off.
A fourth failure pattern is mismatched initialization — Kaiming init with a tanh activation, or Xavier init with ReLU — which produces pre-activation distributions that drift and saturate within a few batches.
Quick Checks¶
- Is the final layer activation matched to the loss?
- Are the hidden activations ReLU (or a ReLU relative)?
- Is the initialization matched to the activation family?
- What fraction of ReLU units are zero on a batch? Anything over 30% is suspicious.
- Are gradient magnitudes roughly comparable across layers?
Practice¶
- Train the same MLP with sigmoid, tanh, ReLU, and GELU hidden activations. Plot loss curves and compare.
- Stack 20 layers with sigmoid and ReLU respectively and plot gradient magnitudes per layer.
- Count dead ReLUs after a few epochs with a too-large learning rate. Repeat with a smaller learning rate.
- Swap ReLU for Leaky ReLU in the dead-neuron scenario and observe the recovery.
- Deliberately apply softmax before
CrossEntropyLossand show that training fails to converge. - Explain why GELU is common in transformers and why ReLU is still common in CNNs.
- Describe one reason tanh survives in LSTM cells while being unusual in feedforward nets.
- Explain why the output activation for multi-label classification is sigmoid, not softmax.
- State the relationship between activation choice and initialization scheme.
- Describe what happens to gradients when a sigmoid is saturated at 0.99.
Longer Connection¶
Activation functions sit next to:
- Backpropagation — the chain-rule picture that makes dead neurons and vanishing gradients concrete
- Batch Normalization and Initialization — the layers and init schemes that keep pre-activations in a healthy range
- Optimizers and Regularization — the update rules that consume the gradient through the activation
- Debugging Deep Learning — how to locate the broken layer when training misbehaves
The activation is the nonlinear knob that makes depth work. Picking it is easy; diagnosing when it misbehaves is where the skill lives.