Backpropagation¶
What This Is¶
Backpropagation is how a neural network computes gradients of its loss with respect to every parameter. It is the reverse-mode automatic differentiation of the forward pass: walk the computation graph forward to compute the loss, then walk it backward applying the chain rule to accumulate gradients.
The practical lesson is that backpropagation is not a separate algorithm you write — it is built into PyTorch's autograd. What you actually need to understand is the chain rule picture: every gradient you see is a product of partial derivatives along one path through the computation graph. When training breaks — NaN gradients, zero gradients, exploding gradients — it breaks on one of those paths, and knowing which path is the whole debugging skill.
When You Use It¶
- every time you call
.backward()on a loss tensor (which is always) - whenever you need to debug why a gradient is zero, NaN, or exploding
- when deciding whether to freeze parameters, add a skip connection, or change activation
- when computing custom losses — you need to know which paths gradients will flow along
The Chain Rule Core¶
For a scalar loss L that depends on z, which depends on w:
∂L/∂w = (∂L/∂z) · (∂z/∂w)
For a longer chain L → z_N → z_{N-1} → ... → z_1 → w:
∂L/∂w = (∂L/∂z_N) · (∂z_N/∂z_{N-1}) · ... · (∂z_1/∂w)
Every gradient in a deep network is a product like this. Every failure mode is a property of that product.
The Forward-Backward Picture¶
Take a tiny two-layer network:
x → [W1, b1] → h = ReLU(W1 x + b1) → [W2, b2] → y = W2 h + b2 → L = (y - target)^2
Forward pass: compute h, then y, then L. Cache intermediates (z1 = W1 x + b1, h, z2 = W2 h + b2, y) — backprop needs them.
Backward pass:
∂L/∂y = 2 (y - target)
∂L/∂W2 = (∂L/∂y) · h^T
∂L/∂b2 = (∂L/∂y)
∂L/∂h = W2^T · (∂L/∂y)
∂L/∂z1 = (∂L/∂h) ⊙ ReLU'(z1) # elementwise multiply by the ReLU gate mask
∂L/∂W1 = (∂L/∂z1) · x^T
∂L/∂b1 = (∂L/∂z1)
Read the pattern: each parameter's gradient is a product of upstream gradients times something local. The network stores the locals during the forward pass (this is why autograd uses memory proportional to the forward).
Why Gradients Vanish Or Explode¶
Look at the chain rule product:
∂L/∂z_1 = (∂L/∂z_N) · Π_i (∂z_{i+1}/∂z_i)
If each ∂z_{i+1}/∂z_i has magnitude less than 1, the product shrinks exponentially with depth — vanishing gradients. If each has magnitude greater than 1, the product explodes.
That is not a flaw of backprop; it is a flaw of the architecture. The standard fixes target the per-layer factor:
- ReLU — derivative is 0 or 1, breaking the saturation of sigmoid/tanh; see Activation Functions
- Batch normalization — keeps layer inputs in a well-conditioned range; see Batch Normalization and Initialization
- Residual connections — the skip adds an identity path so the chain-rule product has a
1term that cannot vanish; see Convolutional Neural Networks - Careful initialization — Kaiming (for ReLU) or Xavier (for tanh) initializations are chosen so the variance of activations is preserved layer to layer
- Gradient clipping — hard cap on gradient norm to prevent explosion; standard for RNNs
Autograd In Practice¶
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x
y.backward()
print(x.grad) # 2*x + 3 = 7
Three rules for clean gradient flow:
- never detach a tensor you want gradients through —
.detach()andtensor.databreak the graph - never mutate a tensor in place if it is used later in the forward pass —
x += 1on a graph tensor is a common NaN source - zero the gradients every step —
optimizer.zero_grad()beforeloss.backward(); PyTorch accumulates by default
Inspecting Gradients¶
for name, p in model.named_parameters():
if p.grad is None:
print(name, "grad is None — not in the graph")
else:
print(name, p.grad.abs().mean().item())
That short loop is the most useful debugging tool you have. Gradient mean near zero on early layers with nonzero mean on late layers is vanishing gradients. Gradient NaN on any layer is usually a loss that hit log(0) or sqrt(negative). Gradient at 10^10 on any layer signals explosion — clip or reduce learning rate.
What To Inspect¶
- gradient magnitudes layer by layer — any layer near zero or NaN is the broken one
- whether the loss actually decreases over the first dozen steps
- whether a tiny overfit batch (4-8 examples) drives the loss toward zero — if not, the model or the loss has a bug that is not about data
- whether
optimizer.zero_grad()is called beforebackward() - whether any operation is detaching the graph — autograd's
register_hookhelps locate this
Failure Pattern¶
Expecting gradients to flow through a .detach() or a .data access. Both break the graph silently; the parameters downstream do not get updated and training looks stuck.
A second failure pattern is accumulating gradients across steps without calling zero_grad(). Loss curves look noisy, and after a few steps the update sizes blow up.
A third failure pattern is a NaN from log(probability) when probability can be exactly zero. Use log_softmax + nll_loss (or cross_entropy, which combines them) instead of log(softmax(...)) — the fused op handles the zero case numerically.
A fourth failure pattern is using a vanilla deep stack with sigmoid activations and being surprised when gradients vanish. The fix is not "more learning rate"; the fix is ReLU, residuals, or normalization.
Quick Checks¶
- Is the loss a scalar
requires_grad=Truetensor connected to the parameters by differentiable operations? - Is every parameter you want to train an
nn.Parameter(not a plain tensor)? - Does
optimizer.zero_grad()run before everybackward()? - Do gradient magnitudes across layers look reasonable and not all near zero?
- Does the model overfit a 4-example batch cleanly?
Practice¶
- Implement the forward and backward pass of a two-layer MLP by hand and compare to
autograd. - Train a 20-layer MLP with sigmoid activations and plot gradient magnitudes per layer. Repeat with ReLU and compare.
- Deliberately break the graph with
.detach()and observe what stops learning. - Add a residual skip to a deep MLP and show the gradient-magnitude difference.
- Apply gradient clipping to a training loop that diverges. Compare runs with and without clipping.
- Explain why cross-entropy combined with softmax is more numerically stable than applying them separately.
- Describe one case where vanishing gradients mimic a "learning rate too low" problem.
- Explain what an in-place operation on a graph tensor can break.
- Describe what Kaiming initialization preserves that random small-uniform does not.
- State the relationship between gradient explosion and learning-rate choice.
Longer Connection¶
Backpropagation sits next to:
- Activation Functions — the per-layer factor in the chain-rule product
- Batch Normalization and Initialization — the layer and the init scheme that keep the product well conditioned
- PyTorch Training Loops — the concrete code that calls
backward() - Optimizers and Regularization — what consumes the gradients once they are computed
- Debugging Deep Learning — the broader diagnostic workflow when a training run breaks
Backpropagation is not a mystery to unlock; it is a product of local derivatives you already understand. The skill is learning to read the chain rule and know which link in the chain is responsible for the training curve you see.