Backpropagation¶

What This Is¶

Backpropagation is how a neural network computes gradients of its loss with respect to every parameter. It is the reverse-mode automatic differentiation of the forward pass: walk the computation graph forward to compute the loss, then walk it backward applying the chain rule to accumulate gradients.

The practical lesson is that backpropagation is not a separate algorithm you write — it is built into PyTorch's autograd. What you actually need to understand is the chain rule picture: every gradient you see is a product of partial derivatives along one path through the computation graph. When training breaks — NaN gradients, zero gradients, exploding gradients — it breaks on one of those paths, and knowing which path is the whole debugging skill.

When You Use It¶

every time you call .backward() on a loss tensor (which is always)
whenever you need to debug why a gradient is zero, NaN, or exploding
when deciding whether to freeze parameters, add a skip connection, or change activation
when computing custom losses — you need to know which paths gradients will flow along

The Chain Rule Core¶

For a scalar loss L that depends on z, which depends on w:

∂L/∂w = (∂L/∂z) · (∂z/∂w)

For a longer chain L → z_N → z_{N-1} → ... → z_1 → w:

∂L/∂w = (∂L/∂z_N) · (∂z_N/∂z_{N-1}) · ... · (∂z_1/∂w)

Every gradient in a deep network is a product like this. Every failure mode is a property of that product.

The Forward-Backward Picture¶

Take a tiny two-layer network:

x → [W1, b1] → h = ReLU(W1 x + b1) → [W2, b2] → y = W2 h + b2 → L = (y - target)^2

Forward pass: compute h, then y, then L. Cache intermediates (z1 = W1 x + b1, h, z2 = W2 h + b2, y) — backprop needs them.

Backward pass:

∂L/∂y  = 2 (y - target)
∂L/∂W2 = (∂L/∂y)  · h^T
∂L/∂b2 = (∂L/∂y)
∂L/∂h  = W2^T · (∂L/∂y)
∂L/∂z1 = (∂L/∂h) ⊙ ReLU'(z1)           # elementwise multiply by the ReLU gate mask
∂L/∂W1 = (∂L/∂z1) · x^T
∂L/∂b1 = (∂L/∂z1)

Read the pattern: each parameter's gradient is a product of upstream gradients times something local. The network stores the locals during the forward pass (this is why autograd uses memory proportional to the forward).

Why Gradients Vanish Or Explode¶

Look at the chain rule product:

∂L/∂z_1 = (∂L/∂z_N) · Π_i (∂z_{i+1}/∂z_i)

If each ∂z_{i+1}/∂z_i has magnitude less than 1, the product shrinks exponentially with depth — vanishing gradients. If each has magnitude greater than 1, the product explodes.

That is not a flaw of backprop; it is a flaw of the architecture. The standard fixes target the per-layer factor:

ReLU — derivative is 0 or 1, breaking the saturation of sigmoid/tanh; see Activation Functions
Batch normalization — keeps layer inputs in a well-conditioned range; see Batch Normalization and Initialization
Residual connections — the skip adds an identity path so the chain-rule product has a 1 term that cannot vanish; see Convolutional Neural Networks
Careful initialization — Kaiming (for ReLU) or Xavier (for tanh) initializations are chosen so the variance of activations is preserved layer to layer
Gradient clipping — hard cap on gradient norm to prevent explosion; standard for RNNs

Autograd In Practice¶

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x ** 2 + 3 * x
y.backward()
print(x.grad)    # 2*x + 3 = 7

Three rules for clean gradient flow:

never detach a tensor you want gradients through — .detach() and tensor.data break the graph
never mutate a tensor in place if it is used later in the forward pass — x += 1 on a graph tensor is a common NaN source
zero the gradients every step — optimizer.zero_grad() before loss.backward(); PyTorch accumulates by default

Inspecting Gradients¶

for name, p in model.named_parameters():
    if p.grad is None:
        print(name, "grad is None — not in the graph")
    else:
        print(name, p.grad.abs().mean().item())

That short loop is the most useful debugging tool you have. Gradient mean near zero on early layers with nonzero mean on late layers is vanishing gradients. Gradient NaN on any layer is usually a loss that hit log(0) or sqrt(negative). Gradient at 10^10 on any layer signals explosion — clip or reduce learning rate.

What To Inspect¶

gradient magnitudes layer by layer — any layer near zero or NaN is the broken one
whether the loss actually decreases over the first dozen steps
whether a tiny overfit batch (4-8 examples) drives the loss toward zero — if not, the model or the loss has a bug that is not about data
whether optimizer.zero_grad() is called before backward()
whether any operation is detaching the graph — autograd's register_hook helps locate this

Failure Pattern¶

Expecting gradients to flow through a .detach() or a .data access. Both break the graph silently; the parameters downstream do not get updated and training looks stuck.

A second failure pattern is accumulating gradients across steps without calling zero_grad(). Loss curves look noisy, and after a few steps the update sizes blow up.

A third failure pattern is a NaN from log(probability) when probability can be exactly zero. Use log_softmax + nll_loss (or cross_entropy, which combines them) instead of log(softmax(...)) — the fused op handles the zero case numerically.

A fourth failure pattern is using a vanilla deep stack with sigmoid activations and being surprised when gradients vanish. The fix is not "more learning rate"; the fix is ReLU, residuals, or normalization.

Quick Checks¶

Is the loss a scalar requires_grad=True tensor connected to the parameters by differentiable operations?
Is every parameter you want to train an nn.Parameter (not a plain tensor)?
Does optimizer.zero_grad() run before every backward()?
Do gradient magnitudes across layers look reasonable and not all near zero?
Does the model overfit a 4-example batch cleanly?

Practice¶

Implement the forward and backward pass of a two-layer MLP by hand and compare to autograd.
Train a 20-layer MLP with sigmoid activations and plot gradient magnitudes per layer. Repeat with ReLU and compare.
Deliberately break the graph with .detach() and observe what stops learning.
Add a residual skip to a deep MLP and show the gradient-magnitude difference.
Apply gradient clipping to a training loop that diverges. Compare runs with and without clipping.
Explain why cross-entropy combined with softmax is more numerically stable than applying them separately.
Describe one case where vanishing gradients mimic a "learning rate too low" problem.
Explain what an in-place operation on a graph tensor can break.
Describe what Kaiming initialization preserves that random small-uniform does not.
State the relationship between gradient explosion and learning-rate choice.

Longer Connection¶

Backpropagation sits next to:

Activation Functions — the per-layer factor in the chain-rule product
Batch Normalization and Initialization — the layer and the init scheme that keep the product well conditioned
PyTorch Training Loops — the concrete code that calls backward()
Optimizers and Regularization — what consumes the gradients once they are computed
Debugging Deep Learning — the broader diagnostic workflow when a training run breaks

Backpropagation is not a mystery to unlock; it is a product of local derivatives you already understand. The skill is learning to read the chain rule and know which link in the chain is responsible for the training curve you see.