Fundamentals 8 min read

Understanding Backpropagation: The Engine of Neural Network Learning

Neural Networks Guide

December 15, 2024

Backpropagation is often considered the most important algorithm in deep learning. It's the mechanism that allows neural networks to learn from their errors, adjusting millions (or billions) of parameters to minimize the difference between predicted and actual outputs.

What Is Backpropagation?

At its core, backpropagation (short for "backward propagation of errors") is an algorithm for computing the gradient of the loss function with respect to each weight in the network. It leverages the chain rule from calculus to efficiently compute these gradients layer by layer, starting from the output and working backward.

The key insight is that we don't need to compute each gradient independently. Instead, we can reuse intermediate results as we propagate backward through the network, making the computation dramatically more efficient.

The Forward Pass

Before we can backpropagate, we need a forward pass. During the forward pass, input data flows through the network layer by layer. At each layer, we compute a weighted sum and apply an activation function:

forward_pass.py

import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def forward(x, W1, b1, W2, b2):
    # Layer 1: hidden layer
    z1 = x @ W1 + b1
    a1 = sigmoid(z1)
    
    # Layer 2: output layer
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)
    
    return z1, a1, z2, a2

The Backward Pass

The backward pass is where the magic happens. We compute the gradient of our loss function with respect to each parameter. Starting from the output layer, we compute how much each weight contributed to the error, then propagate these "error signals" backward through the network.

The chain rule tells us that the gradient of a composition of functions is the product of the gradients of each function. For a network with layers, this means:

∂L/∂w = ∂L/∂a · ∂a/∂z · ∂z/∂w

The chain rule applied to a single layer

Computing Gradients Step by Step

Let's implement the backward pass for our two-layer network. We'll use mean squared error (MSE) as our loss function:

backward_pass.py

def backward(x, y, z1, a1, z2, a2, W2):
    m = x.shape[0]
    
    # Output layer gradients
    dz2 = a2 - y  # ∂L/∂z2
    dW2 = a1.T @ dz2 / m
    db2 = np.sum(dz2, axis=0) / m
    
    # Hidden layer gradients
    dz1 = (dz2 @ W2.T) * a1 * (1 - a1)
    dW1 = x.T @ dz1 / m
    db1 = np.sum(dz1, axis=0) / m
    
    return dW1, db1, dW2, db2

Updating the Weights

Once we have the gradients, we update each weight by subtracting a small fraction (the learning rate) of the gradient. This nudges the weights in the direction that reduces the loss:

w = w - α · ∂L/∂w

where α is the learning rate (typically 0.001 to 0.01)

This process — forward pass, compute loss, backward pass, update weights — repeats thousands or millions of times during training. Each iteration brings the network's predictions closer to the desired output.

Key Takeaways

Backpropagation uses the chain rule to efficiently compute gradients for all parameters.
Gradients flow backward from the loss through each layer, accumulating via multiplication.
The learning rate controls the step size — too large and training diverges, too small and it's painfully slow.
Modern frameworks like PyTorch handle backpropagation automatically via autograd, but understanding the math is invaluable.

What's Next?

Now that you understand backpropagation, explore how different optimizers like Adam and RMSprop improve upon vanilla gradient descent, or dive into our deep learning module to learn about CNNs and Transformers.

Deep Learning → Code Examples