Activation Functions¶

What This Is¶

Activation functions are the element-wise nonlinearities a neural network inserts between linear layers. Without them a stack of linear layers collapses algebraically into a single linear layer — the activation is what gives depth its representational power.

The practical lesson is that the activation choice is almost always ReLU (and its relatives) for hidden layers and softmax (for classification) or sigmoid (for binary and gating) for outputs. The interesting part of this topic is not the function formulas — it is the gradient-flow consequences of each choice and the diagnostic signs of their failure modes.

When You Use It¶

between every linear layer inside a neural network
at the output, matched to the loss: linear + MSE, sigmoid + BCE, softmax + cross-entropy
in gating structures: LSTM gates, attention softmax, gated feedforward
at the end of a convolutional block before pooling

The Core Set¶

Function	Formula	Range	Derivative	Notes
ReLU	`max(0, x)`	`[0, ∞)`	`1` if `x > 0` else `0`	default for hidden layers
Leaky ReLU	`max(αx, x)`, `α ≈ 0.01`	`(-∞, ∞)`	`1` or `α`	fixes dead-ReLU
ELU	`x` if `x > 0` else `α(e^x - 1)`	`(-α, ∞)`	smooth	smoother alternative to leaky ReLU
GELU	`x · Φ(x)`	`(-∞, ∞)`	smooth	standard in transformers
Sigmoid	`1 / (1 + e^-x)`	`(0, 1)`	`σ(x)(1 - σ(x))`, max 0.25	output activation for binary
Tanh	`(e^x - e^-x) / (e^x + e^-x)`	`(-1, 1)`	`1 - tanh²(x)`, max 1	zero-centered, used in RNNs
Softmax	`e^{x_i} / Σ_j e^{x_j}`	`(0, 1)` each, sum = 1	vector-valued Jacobian	output activation for multi-class
Swish / SiLU	`x · σ(x)`	`(-∞, ∞)`	smooth	common in modern vision backbones

ReLU Is The Default — And Why¶

ReLU has a derivative of exactly 1 on the positive side. That matters because in the chain-rule product of backpropagation, factors of 1 do not shrink. Stacking sigmoids (each with max derivative 0.25) across 20 layers gives a gradient factor of at most 0.25^20 ≈ 10^-13 — a vanishing gradient. ReLU avoids that for all positive pre-activations.

For the underlying chain-rule picture, see Backpropagation.

Dead Neurons — The ReLU Failure Mode¶

If a ReLU unit's input becomes consistently negative early in training, its gradient is zero everywhere. No gradient means no update, so the unit never learns and never recovers. The entire neuron is dead.

Diagnostic: count the fraction of units with zero activation across a batch. Above ~30% is usually a dead-neuron problem.

Fixes:

lower the learning rate — large updates often push units into the dead zone
use Leaky ReLU, ELU, or GELU — these have nonzero gradient on the negative side
use a proper initialization (see below) so pre-activations start balanced around zero

Vanishing Gradients — The Sigmoid / Tanh Failure Mode¶

Sigmoid saturates to 0 or 1 far from the origin; its derivative approaches zero in both tails. When inputs to a sigmoid are large in magnitude, the gradient through that unit is effectively zero and backpropagation cannot update upstream weights.

Tanh has the same problem (saturation at ±1) but is zero-centered, which is why it was the pre-ReLU default for hidden layers. Still, modern networks almost never use sigmoid or tanh in hidden layers — the saturating gradient is too costly in deep stacks.

When you see sigmoid or tanh:

output layer of a binary classifier (with BCE loss)
RNN/LSTM gates, where saturation is actually desirable for a gate
the output of a generator with a bounded range
legacy code — check if ReLU would help

Output Activation Must Match The Loss¶

Task	Output Activation	Loss
regression	none (linear)	MSE / MAE
binary classification	sigmoid	binary cross-entropy
multi-class classification	softmax	cross-entropy
multi-label classification	sigmoid per label	binary cross-entropy per label

PyTorch note: nn.CrossEntropyLoss expects raw logits, not softmax outputs. Applying softmax before CrossEntropyLoss is a very common bug. Similarly nn.BCEWithLogitsLoss expects logits, not sigmoid outputs — and it is numerically safer than BCELoss(sigmoid(x), y).

Initialization Matches The Activation¶

Activation choice and weight initialization are coupled:

ReLU-family → Kaiming (He) initialization: variance 2 / fan_in
tanh / sigmoid → Xavier (Glorot) initialization: variance 1 / fan_in or 2 / (fan_in + fan_out)

PyTorch defaults usually do this correctly if you use nn.Linear. For custom layers, call nn.init.kaiming_normal_(weight, nonlinearity="relu") or the equivalent.

See Batch Normalization and Initialization for the full story.

Minimal Example¶

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(64, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 10),   # raw logits; no softmax here
)
loss_fn = nn.CrossEntropyLoss()   # expects logits

No activation on the final layer — CrossEntropyLoss applies log_softmax internally.

What To Inspect¶

the fraction of ReLU units that are zero on a representative batch — above 30% is a dead-neuron problem
gradient magnitudes per layer — shrinking by depth points at saturating nonlinearities
pre-activation distributions — if they drift far from zero, initialization or normalization is broken
whether the final layer matches the loss (softmax + CE both pre-applied is the classic double-softmax bug)
what swapping ReLU for GELU or Swish does — usually small, sometimes meaningful

Failure Pattern¶

Applying softmax before CrossEntropyLoss. The loss sees already-normalized probabilities, applies its own log-softmax on top, and the gradients become doubly saturated — training goes nowhere.

A second failure pattern is placing sigmoid or tanh in hidden layers of a deep network and blaming the optimizer when gradients vanish. The activation is the cause.

A third failure pattern is ignoring dead ReLUs. The network keeps training; the accuracy plateaus early because half the capacity is silently turned off.

A fourth failure pattern is mismatched initialization — Kaiming init with a tanh activation, or Xavier init with ReLU — which produces pre-activation distributions that drift and saturate within a few batches.

Quick Checks¶

Is the final layer activation matched to the loss?
Are the hidden activations ReLU (or a ReLU relative)?
Is the initialization matched to the activation family?
What fraction of ReLU units are zero on a batch? Anything over 30% is suspicious.
Are gradient magnitudes roughly comparable across layers?

Practice¶

Train the same MLP with sigmoid, tanh, ReLU, and GELU hidden activations. Plot loss curves and compare.
Stack 20 layers with sigmoid and ReLU respectively and plot gradient magnitudes per layer.
Count dead ReLUs after a few epochs with a too-large learning rate. Repeat with a smaller learning rate.
Swap ReLU for Leaky ReLU in the dead-neuron scenario and observe the recovery.
Deliberately apply softmax before CrossEntropyLoss and show that training fails to converge.
Explain why GELU is common in transformers and why ReLU is still common in CNNs.
Describe one reason tanh survives in LSTM cells while being unusual in feedforward nets.
Explain why the output activation for multi-label classification is sigmoid, not softmax.
State the relationship between activation choice and initialization scheme.
Describe what happens to gradients when a sigmoid is saturated at 0.99.

Longer Connection¶

Activation functions sit next to:

Backpropagation — the chain-rule picture that makes dead neurons and vanishing gradients concrete
Batch Normalization and Initialization — the layers and init schemes that keep pre-activations in a healthy range
Optimizers and Regularization — the update rules that consume the gradient through the activation
Debugging Deep Learning — how to locate the broken layer when training misbehaves

The activation is the nonlinear knob that makes depth work. Picking it is easy; diagnosing when it misbehaves is where the skill lives.