Reverse-mode Automatic Differentiation

Automatic differentiation (AD) is a general computational technique for evaluating exact derivatives of functions defined by computer programs. Unlike symbolic differentiation, AD does not manipulate algebraic expressions, and unlike finite differences, it does not rely on numerical approximations. Instead, it decomposes a computation into a sequence of elementary operations and systematically applies the chain rule to propagate derivatives through that sequence with machine precision. Conceptually, AD can be viewed as an efficient, programmatic implementation of the chain rule of calculus.


AD is central to modern machine learning frameworks such as PyTorch (autograd), TensorFlow (gradient) and JAX (grad). During the forward evaluation of a function, these libraries construct a computational graph that records intermediate quantities and their dependencies. In reverse-mode AD, this graph is then traversed in reverse to compute derivatives with respect to potentially millions of parameters.


Reverse-mode AD vs. backpropagation

Reverse-mode AD is particularly powerful when computing the gradient of a scalar-valued function with respect to many inputs, i.e. $f:\mathbb{R}^n\to\mathbb{R}$. Forward mode AD is not considered here, but has its own use cases (e.g., when the function has many outputs and few inputs). More broadly, AD refers to the general process of differentiation on computational graphs. Backpropagation typically refers to the sepecific application of reverse-mode AD in the setting of neural networks, where the graph is defined by the network architecture. Neural networks are well-suited to this approach because training typically involves minimising a scalar loss function, such as sum of squared errors or cross-entropy, with respect to a very large number of parameters.

A key practical result is that the cost of computing the full gradient by reverse-mode AD is comparable (up to a small constant factor) to evaluating the original function itself. This efficiency is what makes modern deep learning feasible.


Goal of this page

Oftentimes, AD is used as a black box, particularly in the context of backpropagation. However, it is worth understanding once (and properly) how it works before relying on it routinely.


The goal of this page is to walk through a minimal example that illustrates the key ideas behind reverse-mode AD. We consider example 3 from the accompanying demonstration, namely $y = f(x_1,x_2) = (x_1-x_2)^2 + \cos(x_1+x_2)$. This is deliberately not a neural network, allowing us to focus on the general mechanism rather than a specific architecture.

Write $f$ as sequence of elementary operations

Any function implemented on a computer is ultimately evaluated as a sequence of simple, elementary operations. To apply automatic differentiation, we must first express the function in this form. For $y=(x_1-x_2)^2+\cos(x_1+x_2)$, we introduce intermediate variables:

\[ t_1=x_1-x_2,\quad t_2=t_1^2,\quad t_3=x_1+x_2,\quad t_4=\cos(t_3),\quad y=t_2+t_4 \]

Each assignment represents a single elementary operation (addition, subtraction, squaring, cosine). The set of variables $(x_1,x_2,t_1,t_2,t_3,t_4,y)$ will be referred to as graph nodes. These nodes record every intermediate quantity computed during the forward evaluation of the function.

Computational graph

The diagram below visualises how the function is evaluated on a computer.

x₁ x₂ t₁ = x₁ − x₂ t₂ = t₁² t₃ = x₁ + x₂ t₄ = cos(t₃) y = t₂ + t₄

Each node corresponds to one intermediate variable, and each edge represents a dependency. The graph encodes the full structure of the computation. We say that $y$ is the child of $t_2$ and $t_4$, that $t_1$ is the parent of $t_2$, for example. Once this structure is known, differentiation becomes a systematic process of propagating sensitivities along the edges.

Differentiate by hand

Before automating the procedure, let us compute the gradient manually. We seek $\nabla y = (\frac{\partial y}{\partial x_1}, \frac{\partial y}{\partial x_2})$. Notice that $x_1$ influences $y$ through two distinct paths in the graph: $x_1 \to t_1 \to t_2 \to y$ and $x_1 \to t_3 \to t_4 \to y$. By the chain rule, the total derivative is the sum of the contributions from each path:

$$ \frac{\partial y}{\partial x_1} = \frac{\partial y}{\partial t_2}\frac{\partial t_2}{\partial t_1}\frac{\partial t_1}{\partial x_1} + \frac{\partial y}{\partial t_4}\frac{\partial t_4}{\partial t_3}\frac{\partial t_3}{\partial x_1} = 2t_1 -\sin t_3 = 2(x_1-x_2) - \sin(x_1+x_2) $$

The same reasoning applies in the $x_2$-direction, giving:

$$ \frac{\partial y}{\partial x_2} = \frac{\partial y}{\partial t_2}\frac{\partial t_2}{\partial t_1}\frac{\partial t_1}{\partial x_2} + \frac{\partial y}{\partial t_4}\frac{\partial t_4}{\partial t_3}\frac{\partial t_3}{\partial x_2} = -2t_1 -\sin t_3 = -2(x_1-x_2) - \sin(x_1+x_2) $$

For a small graph like this, the computation is straightforward. But if the graph contained millions of nodes and deeply nested dependencies, manually enumerating all paths would be infeasible. Reverse-mode AD automates this process, as we will see.

Local derivatives

AD relies on knowing the local derivative of each elementary operation. These are simple, one-step derivatives. For this example, the local rules required here are:

$$ \frac{\partial (a+b)}{\partial a}=1,\ \frac{\partial (a+b)}{\partial b}=1,\ \frac{\partial (a-b)}{\partial a}=1,\ \frac{\partial (a-b)}{\partial b}=-1,\ \frac{\partial (a^2)}{\partial a}=2a,\ \frac{\partial \cos(a)}{\partial a}=-\sin(a). $$

Modern frameworks store these local rules internally (see, e.g., PyTorch’s source code). You can even write your own custom operations and specify their local derivatives (see, e.g., TensorFlow's RegisterGradient function). Reverse-mode AD simply applies these rules repeatedly, combining them according to the structure of the computational graph.

Where the autodiff update comes from

Let $g_n := \frac{\partial y}{\partial n}$ denote the gradient of the final output with respect to any node $n$. Suppose a node $p$ feeds into several children $c_1, \dots, c_m$. Because each child influences the final output, the variable $p$ affects $y$ through all downstream paths.

By the chain rule,

$$ \frac{\partial y}{\partial p} = \sum_{j=1}^{m} \frac{\partial y}{\partial c_j} \frac{\partial c_j}{\partial p}. $$

In gradient notation, this becomes

$$ g_p = \sum_{j=1}^{m} g_{c_j}\,\frac{\partial c_j}{\partial p}. $$

Two important ideas appear in this formula:

During reverse traversal of the graph we process one child at a time. Each time we compute a contribution, we add it to the running total:

$$ g_p \mathrel{+}= g_{c_j} \cdot \frac{\partial c_j}{\partial p},\quad\quad \text{which means increment}\ g_p \leftarrow g_p + g_{c_j} \cdot \frac{\partial c_j}{\partial p}. $$

In summary, backpropagation is a recursive accumulation of (upstream gradient) $\times$ (local derivative) applied systematically from the output node back to the inputs.

Numeric walkthrough (\(x_1=1,\ x_2=2\))

Forward values:

\[ t_1=-1,\quad t_2=1,\quad t_3=3,\quad t_4=\cos(3)\approx-0.98999,\quad y\approx0.01001 \]

Reverse accumulation:

Initialise: $\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\ \ \ g_y=1$
Process parents of $y$, i.e. $(t_2,t_4)$: $\quad\quad\ \, g_{t_2} \mathrel{+}= g_y \cdot \frac{\partial y}{\partial t_2} = 1,\quad\quad g_{t_4}{+}= g_y \cdot \frac{\partial y}{\partial t_4} = 1$
Process parents of $t_2$, i.e. $t_1$: $\quad\quad\quad\quad\, g_{t_1} \mathrel{+}= g_{t_2}\cdot\frac{\partial t_2}{\partial t_1} = 1\cdot (2t_1)= -2$
Process parents of $t_4$, i.e. $t_3$: $\quad\quad\quad\quad\, g_{t_3} \mathrel{+}= g_{t_4}\cdot\frac{\partial t_4}{\partial t_3} = 1\cdot(-\sin t_3)\approx-0.14112$
Process parents of $t_1$, i.e. $(x_1,x_2)$: $\quad\ \ g_{x_1} \mathrel{+}= g_{t_1}\cdot \frac{\partial t_1}{\partial x_1} = -2\cdot 1=-2,\quad\quad g_{x_2} \mathrel{+}= g_{t_1}\cdot \frac{\partial t_1}{\partial x_2} = -2\cdot(-1)=2$
Process parents of $t_3$, i.e. $(x_1,x_2)$: $\quad\ \ g_{x_1} \mathrel{+}= g_{t_3}\cdot \frac{\partial t_3}{\partial x_1} = -0.14112\cdot1=-0.14112,\quad\quad g_{x_2} \mathrel{+}= g_{t_3}\cdot \frac{\partial t_3}{\partial x_2} = -0.14112\cdot1=-0.14112$

Final gradients:

\[ \frac{\partial y}{\partial x_1} = g_{x_1} \approx -2.14112,\qquad \frac{\partial y}{\partial x_2} = g_{x_2}\approx 1.85888 \]

Comparing this with the analytical gradients computed at the beginning, $$\nabla y(1,2) = (2(1-2) - \sin(1+2), -2(1-2) - \sin(1+2)) = (-2 - \sin(3), 2 - \sin(3)) \approx (-2.14112, 1.85888),$$ we achieve the same result.

In the accompanying demonstration, you can run both the forward and backward passes interactively. At each stage, try to follow the calculations at each step and verify that the final gradients match what you get from the analytical formulas. You can also experiment with gradient descent: update the input variables in the direction of the negative gradient and observe how the output changes. This provides an illustration of how gradients guide optimisation.