Reverse-mode Automatic Differentiation: A Worked Example

Write $f$ as sequence of elementary operations

Any function implemented on a computer is ultimately evaluated as a sequence of simple, elementary operations. To apply automatic differentiation, we must first express the function in this form. For $y=(x_1-x_2)^2+\cos(x_1+x_2)$, we introduce intermediate variables:

\[ t_1=x_1-x_2,\quad t_2=t_1^2,\quad t_3=x_1+x_2,\quad t_4=\cos(t_3),\quad y=t_2+t_4 \]

Each assignment represents a single elementary operation (addition, subtraction, squaring, cosine). The set of variables $(x_1,x_2,t_1,t_2,t_3,t_4,y)$ will be referred to as graph nodes. These nodes record every intermediate quantity computed during the forward evaluation of the function.

Computational graph

The diagram below visualises how the function is evaluated on a computer.

Each node corresponds to one intermediate variable, and each edge represents a dependency. The graph encodes the full structure of the computation. We say that $y$ is the child of $t_2$ and $t_4$, that $t_1$ is the parent of $t_2$, for example. Once this structure is known, differentiation becomes a systematic process of propagating sensitivities along the edges.

Differentiate by hand

Before automating the procedure, let us compute the gradient manually. We seek $\nabla y = (\frac{\partial y}{\partial x_1}, \frac{\partial y}{\partial x_2})$. Notice that $x_1$ influences $y$ through two distinct paths in the graph: $x_1 \to t_1 \to t_2 \to y$ and $x_1 \to t_3 \to t_4 \to y$. By the chain rule, the total derivative is the sum of the contributions from each path:

$$ \frac{\partial y}{\partial x_1} = \frac{\partial y}{\partial t_2}\frac{\partial t_2}{\partial t_1}\frac{\partial t_1}{\partial x_1} + \frac{\partial y}{\partial t_4}\frac{\partial t_4}{\partial t_3}\frac{\partial t_3}{\partial x_1} = 2t_1 -\sin t_3 = 2(x_1-x_2) - \sin(x_1+x_2) $$

The same reasoning applies in the $x_2$-direction, giving:

$$ \frac{\partial y}{\partial x_2} = \frac{\partial y}{\partial t_2}\frac{\partial t_2}{\partial t_1}\frac{\partial t_1}{\partial x_2} + \frac{\partial y}{\partial t_4}\frac{\partial t_4}{\partial t_3}\frac{\partial t_3}{\partial x_2} = -2t_1 -\sin t_3 = -2(x_1-x_2) - \sin(x_1+x_2) $$

For a small graph like this, the computation is straightforward. But if the graph contained millions of nodes and deeply nested dependencies, manually enumerating all paths would be infeasible. Reverse-mode AD automates this process, as we will see.

Local derivatives

AD relies on knowing the local derivative of each elementary operation. These are simple, one-step derivatives. For this example, the local rules required here are:

$$ \frac{\partial (a+b)}{\partial a}=1,\ \frac{\partial (a+b)}{\partial b}=1,\ \frac{\partial (a-b)}{\partial a}=1,\ \frac{\partial (a-b)}{\partial b}=-1,\ \frac{\partial (a^2)}{\partial a}=2a,\ \frac{\partial \cos(a)}{\partial a}=-\sin(a). $$

Modern frameworks store these local rules internally (see, e.g., PyTorch’s source code). You can even write your own custom operations and specify their local derivatives (see, e.g., TensorFlow's RegisterGradient function). Reverse-mode AD simply applies these rules repeatedly, combining them according to the structure of the computational graph.

Where the autodiff update comes from

Let $g_n := \frac{\partial y}{\partial n}$ denote the gradient of the final output with respect to any node $n$. Suppose a node $p$ feeds into several children $c_1, \dots, c_m$. Because each child influences the final output, the variable $p$ affects $y$ through all downstream paths.

By the chain rule,

$$ \frac{\partial y}{\partial p} = \sum_{j=1}^{m} \frac{\partial y}{\partial c_j} \frac{\partial c_j}{\partial p}. $$

In gradient notation, this becomes

$$ g_p = \sum_{j=1}^{m} g_{c_j}\,\frac{\partial c_j}{\partial p}. $$

Two important ideas appear in this formula:

Recursive dependence: the gradient at node $p$ depends on the gradients of its children. Since the process begins at the output node, with initial gradient $g_y = \frac{\partial y}{\partial y} = 1$, and then propagates backward through the graph, then the gradients of the children were computed earlier in the reverse pass and are available for use in computing the gradient of the parent.
Local derivatives: the term $\frac{\partial c_j}{\partial p}$ (i.e. the derivative of the child with respect to the parent) is precisely one of the elementary derivatives listed earlier (e.g. derivative of addition, squaring, cosine, etc.). Reverse-mode AD thus never performs global symbolic differentiation -- it only multiplies already-computed gradients by these local rules.

During reverse traversal of the graph we process one child at a time. Each time we compute a contribution, we add it to the running total:

$$ g_p \mathrel{+}= g_{c_j} \cdot \frac{\partial c_j}{\partial p},\quad\quad \text{which means increment}\ g_p \leftarrow g_p + g_{c_j} \cdot \frac{\partial c_j}{\partial p}. $$

In summary, backpropagation is a recursive accumulation of (upstream gradient) $\times$ (local derivative) applied systematically from the output node back to the inputs.

Numeric walkthrough ($x_1=1,\ x_2=2$)

Forward values:

\[ t_1=-1,\quad t_2=1,\quad t_3=3,\quad t_4=\cos(3)\approx-0.98999,\quad y\approx0.01001 \]

Reverse accumulation:

Initialise: $\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\ \ \ g_y=1$
Process parents of $y$, i.e. $(t_2,t_4)$: $\quad\quad\ \, g_{t_2} \mathrel{+}= g_y \cdot \frac{\partial y}{\partial t_2} = 1,\quad\quad g_{t_4}{+}= g_y \cdot \frac{\partial y}{\partial t_4} = 1$
Process parents of $t_2$, i.e. $t_1$: $\quad\quad\quad\quad\, g_{t_1} \mathrel{+}= g_{t_2}\cdot\frac{\partial t_2}{\partial t_1} = 1\cdot (2t_1)= -2$
Process parents of $t_4$, i.e. $t_3$: $\quad\quad\quad\quad\, g_{t_3} \mathrel{+}= g_{t_4}\cdot\frac{\partial t_4}{\partial t_3} = 1\cdot(-\sin t_3)\approx-0.14112$
Process parents of $t_1$, i.e. $(x_1,x_2)$: $\quad\ \ g_{x_1} \mathrel{+}= g_{t_1}\cdot \frac{\partial t_1}{\partial x_1} = -2\cdot 1=-2,\quad\quad g_{x_2} \mathrel{+}= g_{t_1}\cdot \frac{\partial t_1}{\partial x_2} = -2\cdot(-1)=2$
Process parents of $t_3$, i.e. $(x_1,x_2)$: $\quad\ \ g_{x_1} \mathrel{+}= g_{t_3}\cdot \frac{\partial t_3}{\partial x_1} = -0.14112\cdot1=-0.14112,\quad\quad g_{x_2} \mathrel{+}= g_{t_3}\cdot \frac{\partial t_3}{\partial x_2} = -0.14112\cdot1=-0.14112$

Final gradients:

\[ \frac{\partial y}{\partial x_1} = g_{x_1} \approx -2.14112,\qquad \frac{\partial y}{\partial x_2} = g_{x_2}\approx 1.85888 \]

Comparing this with the analytical gradients computed at the beginning, $$\nabla y(1,2) = (2(1-2) - \sin(1+2), -2(1-2) - \sin(1+2)) = (-2 - \sin(3), 2 - \sin(3)) \approx (-2.14112, 1.85888),$$ we achieve the same result.

Reverse-mode Automatic Differentiation

Reverse-mode AD vs. backpropagation

Goal of this page

Write $f$ as sequence of elementary operations

Computational graph

Differentiate by hand

Local derivatives

Where the autodiff update comes from

Numeric walkthrough (\(x_1=1,\ x_2=2\))