Backpropagation in Deep Reinforcement Learning

This page was last edited on 12 November 2025

Backpropagation is a method used to update the weights in a neural network. In Deep Reinforcement Learning(RL), we use neural networks to estimate value functions or policies. Backpropagation adjusts the network’s parameters to reduce the error between predicted values and target values.

Why do we use backpropagation in Deep RL?

Because Deep RL uses deep neural networks. We need to update the weights of these networks based on feedback (rewards, TD-errors, etc). Backpropagation helps minimize the difference between what the network predicts and what it should have predicted. It’s how the agent improves.

Backpropagation is part of Reinforcement Learning

RL is more than just training a network. It’s about exploration, delayed rewards, and learning through trial and error. Backpropagation handles the learning part. RL handles the decision-making.

Why it matters in robotics

Backpropagation lets robots learn behaviors. A robot arm learns to grasp. A mobile robot learns to navigate. The network gets better by turning experience into better predictions — one gradient at a time.


ANALOGY

We can think at a thermostat that adjusts heating. If the room is too cold, it increases heat. The more off-target the temperature, the more it adjusts. Backpropagation works similarly: the more wrong the prediction, the more it adjusts the weights.


Backpropagation equation(s)

The core idea is to compute the gradient of a loss function with respect to each weight.

Key equations:

1. Loss

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          L = \frac{1}{2} \left( y_{\text{true}} - y_{\text{pred}} \right)^2 \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • ytrue​: target value
  • ypred​: predicted value

2. Gradient of loss w.r.t. prediction

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          \frac{dL}{dy_{\text{pred}}} = y_{\text{pred}} - y_{\text{true}} \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • L: Loss (ex: Mean Squared Error)
  • ytrue​: target value
  • ypred​: predicted value

3. Gradient of loss w.r.t. weights

Using the chain rule:

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          \frac{dL}{dw} = \frac{dL}{dy} \cdot \frac{dy}{dz} \cdot \frac{dz}{dw} \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • L: loss (ex: Mean Squared Error)
  • w: weight we want to update
  • z: weighted sum before activation (i.e., z=w⋅x+b)
  • y: output after activation, i.e., y=f(z)

Each gradient term tells how much a small change in a weight changes the output and the loss.


Setup:

  • One neuron: y= w * x
  • Input x= 1.0
  • Target ytrue= 1.0
  • Learning rate α= 0.1

We use:

    \[ \hspace{5mm} \fbox{     \begin{array}{l}         \vspace{2mm} \\         \displaystyle y_{\text{pred}} = w \cdot x \\\\         \displaystyle L = \frac{1}{2} (y_{\text{true}} - y_{\text{pred}})^2 \\\\         \displaystyle \frac{dL}{dw} = (y_{\text{pred}} - y_{\text{true}}) \cdot x \\\\         \displaystyle w = w - \alpha \cdot \frac{dL}{dw} \\\\         \vspace{2mm}     \end{array} } \hspace{5mm} \]

ITERATION 1

  • ypred​= 0.5 * 1.0= 0.5
  • Error=1/2 * ​(0.5 − 1.0)2=1/2 * 0.25= 0.125
  • Gradient= (0.5−1.0) * 1.0= −0.5
  • Weight update= 0.5 − 0.1 * (−0.5)= 0.5 + 0.05= 0.55

ITERATION 2

  • ypred= 0.55 * 1.0= 0.55
  • Error= 1/2 * ​(0.55 − 1.0)2= 1/2 * 0.2025= 0.10125
  • Gradient= (0.55 − 1.0) * 1.0= −0.45
  • Weight update: 0.55 + 0.045= 0.595

ITERATION 3

  • ypred​= 0.595 * 1.0= 0.595
  • Error= 1/2 * (0.595 − 1.0)2= 1/2 * 0.163025= 0.0815125
  • Gradient= (0.595 − 1.0) * 1.0= −0.405
  • Weight update: 0.595 + 0.0405= 0.6355

ITERATION 4

  • ypred​= 0.6355 * 1.0= 0.6355
  • Error= 1/2 * ​(0.6355 − 1.0)2= 1/2 * 0.13287025= 0.0664351
  • Gradient= (0.6355 − 1.0) * 1.0= −0.3645
  • Weight update: 0.6355 + 0.03645= 0.67195

ITERATION 5

  • ypred​= 0.67195 * 1.0= 0.67195
  • Error= 1/2 * ​(0.67195 − 1.0)2= 1/2 * 0.107616= 0.053808
  • Gradient= (0.67195 − 1.0) * 1.0= −0.32805
  • Weight update: 0.67195 + 0.032805= 0.704755

Backpropagation Loss Over 5 Iterations
Backpropagation Loss Over 5 Iterations

The above graph shows the loss value across 5 training iterations using backpropagation in a simple neural network.

We observe a steady decrease in loss from 0.125 down to 0.0538. This means that the network is learning. Each time the weight is updated using backpropagation, the output gets closer to the target, and the error becomes smaller.


References:


Isaac Sim << Previous | Next >> Weight Initialization