Skip to content

Activation Functions

What This Is

Activation functions are the element-wise nonlinearities a neural network inserts between linear layers. Without them a stack of linear layers collapses algebraically into a single linear layer — the activation is what gives depth its representational power.

The practical lesson is that the activation choice is almost always ReLU (and its relatives) for hidden layers and softmax (for classification) or sigmoid (for binary and gating) for outputs. The interesting part of this topic is not the function formulas — it is the gradient-flow consequences of each choice and the diagnostic signs of their failure modes.

When You Use It

  • between every linear layer inside a neural network
  • at the output, matched to the loss: linear + MSE, sigmoid + BCE, softmax + cross-entropy
  • in gating structures: LSTM gates, attention softmax, gated feedforward
  • at the end of a convolutional block before pooling

The Core Set

Function Formula Range Derivative Notes
ReLU max(0, x) [0, ∞) 1 if x > 0 else 0 default for hidden layers
Leaky ReLU max(αx, x), α ≈ 0.01 (-∞, ∞) 1 or α fixes dead-ReLU
ELU x if x > 0 else α(e^x - 1) (-α, ∞) smooth smoother alternative to leaky ReLU
GELU x · Φ(x) (-∞, ∞) smooth standard in transformers
Sigmoid 1 / (1 + e^-x) (0, 1) σ(x)(1 - σ(x)), max 0.25 output activation for binary
Tanh (e^x - e^-x) / (e^x + e^-x) (-1, 1) 1 - tanh²(x), max 1 zero-centered, used in RNNs
Softmax e^{x_i} / Σ_j e^{x_j} (0, 1) each, sum = 1 vector-valued Jacobian output activation for multi-class
Swish / SiLU x · σ(x) (-∞, ∞) smooth common in modern vision backbones

ReLU Is The Default — And Why

ReLU has a derivative of exactly 1 on the positive side. That matters because in the chain-rule product of backpropagation, factors of 1 do not shrink. Stacking sigmoids (each with max derivative 0.25) across 20 layers gives a gradient factor of at most 0.25^20 ≈ 10^-13 — a vanishing gradient. ReLU avoids that for all positive pre-activations.

For the underlying chain-rule picture, see Backpropagation.

Dead Neurons — The ReLU Failure Mode

If a ReLU unit's input becomes consistently negative early in training, its gradient is zero everywhere. No gradient means no update, so the unit never learns and never recovers. The entire neuron is dead.

Diagnostic: count the fraction of units with zero activation across a batch. Above ~30% is usually a dead-neuron problem.

Fixes:

  • lower the learning rate — large updates often push units into the dead zone
  • use Leaky ReLU, ELU, or GELU — these have nonzero gradient on the negative side
  • use a proper initialization (see below) so pre-activations start balanced around zero

Vanishing Gradients — The Sigmoid / Tanh Failure Mode

Sigmoid saturates to 0 or 1 far from the origin; its derivative approaches zero in both tails. When inputs to a sigmoid are large in magnitude, the gradient through that unit is effectively zero and backpropagation cannot update upstream weights.

Tanh has the same problem (saturation at ±1) but is zero-centered, which is why it was the pre-ReLU default for hidden layers. Still, modern networks almost never use sigmoid or tanh in hidden layers — the saturating gradient is too costly in deep stacks.

When you see sigmoid or tanh:

  • output layer of a binary classifier (with BCE loss)
  • RNN/LSTM gates, where saturation is actually desirable for a gate
  • the output of a generator with a bounded range
  • legacy code — check if ReLU would help

Output Activation Must Match The Loss

Task Output Activation Loss
regression none (linear) MSE / MAE
binary classification sigmoid binary cross-entropy
multi-class classification softmax cross-entropy
multi-label classification sigmoid per label binary cross-entropy per label

PyTorch note: nn.CrossEntropyLoss expects raw logits, not softmax outputs. Applying softmax before CrossEntropyLoss is a very common bug. Similarly nn.BCEWithLogitsLoss expects logits, not sigmoid outputs — and it is numerically safer than BCELoss(sigmoid(x), y).

Initialization Matches The Activation

Activation choice and weight initialization are coupled:

  • ReLU-familyKaiming (He) initialization: variance 2 / fan_in
  • tanh / sigmoidXavier (Glorot) initialization: variance 1 / fan_in or 2 / (fan_in + fan_out)

PyTorch defaults usually do this correctly if you use nn.Linear. For custom layers, call nn.init.kaiming_normal_(weight, nonlinearity="relu") or the equivalent.

See Batch Normalization and Initialization for the full story.

Minimal Example

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(64, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 10),   # raw logits; no softmax here
)
loss_fn = nn.CrossEntropyLoss()   # expects logits

No activation on the final layer — CrossEntropyLoss applies log_softmax internally.

What To Inspect

  • the fraction of ReLU units that are zero on a representative batch — above 30% is a dead-neuron problem
  • gradient magnitudes per layer — shrinking by depth points at saturating nonlinearities
  • pre-activation distributions — if they drift far from zero, initialization or normalization is broken
  • whether the final layer matches the loss (softmax + CE both pre-applied is the classic double-softmax bug)
  • what swapping ReLU for GELU or Swish does — usually small, sometimes meaningful

Failure Pattern

Applying softmax before CrossEntropyLoss. The loss sees already-normalized probabilities, applies its own log-softmax on top, and the gradients become doubly saturated — training goes nowhere.

A second failure pattern is placing sigmoid or tanh in hidden layers of a deep network and blaming the optimizer when gradients vanish. The activation is the cause.

A third failure pattern is ignoring dead ReLUs. The network keeps training; the accuracy plateaus early because half the capacity is silently turned off.

A fourth failure pattern is mismatched initialization — Kaiming init with a tanh activation, or Xavier init with ReLU — which produces pre-activation distributions that drift and saturate within a few batches.

Quick Checks

  1. Is the final layer activation matched to the loss?
  2. Are the hidden activations ReLU (or a ReLU relative)?
  3. Is the initialization matched to the activation family?
  4. What fraction of ReLU units are zero on a batch? Anything over 30% is suspicious.
  5. Are gradient magnitudes roughly comparable across layers?

Practice

  1. Train the same MLP with sigmoid, tanh, ReLU, and GELU hidden activations. Plot loss curves and compare.
  2. Stack 20 layers with sigmoid and ReLU respectively and plot gradient magnitudes per layer.
  3. Count dead ReLUs after a few epochs with a too-large learning rate. Repeat with a smaller learning rate.
  4. Swap ReLU for Leaky ReLU in the dead-neuron scenario and observe the recovery.
  5. Deliberately apply softmax before CrossEntropyLoss and show that training fails to converge.
  6. Explain why GELU is common in transformers and why ReLU is still common in CNNs.
  7. Describe one reason tanh survives in LSTM cells while being unusual in feedforward nets.
  8. Explain why the output activation for multi-label classification is sigmoid, not softmax.
  9. State the relationship between activation choice and initialization scheme.
  10. Describe what happens to gradients when a sigmoid is saturated at 0.99.

Longer Connection

Activation functions sit next to:

The activation is the nonlinear knob that makes depth work. Picking it is easy; diagnosing when it misbehaves is where the skill lives.