This page was last edited on 12 November 2025
Backpropagation is a method used to update the weights in a neural network. In Deep Reinforcement Learning(RL), we use neural networks to estimate value functions or policies. Backpropagation adjusts the network’s parameters to reduce the error between predicted values and target values.
Why do we use backpropagation in Deep RL?
Because Deep RL uses deep neural networks. We need to update the weights of these networks based on feedback (rewards, TD-errors, etc). Backpropagation helps minimize the difference between what the network predicts and what it should have predicted. It’s how the agent improves.
Backpropagation is part of Reinforcement Learning
RL is more than just training a network. It’s about exploration, delayed rewards, and learning through trial and error. Backpropagation handles the learning part. RL handles the decision-making.
Why it matters in robotics
Backpropagation lets robots learn behaviors. A robot arm learns to grasp. A mobile robot learns to navigate. The network gets better by turning experience into better predictions — one gradient at a time.
ANALOGY
We can think at a thermostat that adjusts heating. If the room is too cold, it increases heat. The more off-target the temperature, the more it adjusts. Backpropagation works similarly: the more wrong the prediction, the more it adjusts the weights.
Backpropagation equation(s)
The core idea is to compute the gradient of a loss function with respect to each weight.
Key equations:
1. Loss
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle L = \frac{1}{2} \left( y_{\text{true}} - y_{\text{pred}} \right)^2 \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-003029473332e2f313647e91577bb28d_l3.png)
Where:
- ytrue: target value
- ypred: predicted value
2. Gradient of loss w.r.t. prediction
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \frac{dL}{dy_{\text{pred}}} = y_{\text{pred}} - y_{\text{true}} \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-ca19129e2fa0b5a3599b141931e8140d_l3.png)
Where:
- L: Loss (ex: Mean Squared Error)
- ytrue: target value
- ypred: predicted value
3. Gradient of loss w.r.t. weights
Using the chain rule:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \frac{dL}{dw} = \frac{dL}{dy} \cdot \frac{dy}{dz} \cdot \frac{dz}{dw} \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-747ee275ba8805132f44b23e5245a339_l3.png)
Where:
- L: loss (ex: Mean Squared Error)
- w: weight we want to update
- z: weighted sum before activation (i.e., z=w⋅x+b)
- y: output after activation, i.e., y=f(z)
Each gradient term tells how much a small change in a weight changes the output and the loss.
EXAMPLE: How backpropagation is working to minimize the loss
Setup:
- One neuron: y= w * x
- Input x= 1.0
- Target ytrue= 1.0
- Learning rate α= 0.1
We use:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{l} \vspace{2mm} \\ \displaystyle y_{\text{pred}} = w \cdot x \\\\ \displaystyle L = \frac{1}{2} (y_{\text{true}} - y_{\text{pred}})^2 \\\\ \displaystyle \frac{dL}{dw} = (y_{\text{pred}} - y_{\text{true}}) \cdot x \\\\ \displaystyle w = w - \alpha \cdot \frac{dL}{dw} \\\\ \vspace{2mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-8561af82a9f1a5d0abb50d4376e59cf1_l3.png)
ITERATION 1
- ypred= 0.5 * 1.0= 0.5
- Error=1/2 * (0.5 − 1.0)2=1/2 * 0.25= 0.125
- Gradient= (0.5−1.0) * 1.0= −0.5
- Weight update= 0.5 − 0.1 * (−0.5)= 0.5 + 0.05= 0.55
ITERATION 2
- ypred= 0.55 * 1.0= 0.55
- Error= 1/2 * (0.55 − 1.0)2= 1/2 * 0.2025= 0.10125
- Gradient= (0.55 − 1.0) * 1.0= −0.45
- Weight update: 0.55 + 0.045= 0.595
ITERATION 3
- ypred= 0.595 * 1.0= 0.595
- Error= 1/2 * (0.595 − 1.0)2= 1/2 * 0.163025= 0.0815125
- Gradient= (0.595 − 1.0) * 1.0= −0.405
- Weight update: 0.595 + 0.0405= 0.6355
ITERATION 4
- ypred= 0.6355 * 1.0= 0.6355
- Error= 1/2 * (0.6355 − 1.0)2= 1/2 * 0.13287025= 0.0664351
- Gradient= (0.6355 − 1.0) * 1.0= −0.3645
- Weight update: 0.6355 + 0.03645= 0.67195
ITERATION 5
- ypred= 0.67195 * 1.0= 0.67195
- Error= 1/2 * (0.67195 − 1.0)2= 1/2 * 0.107616= 0.053808
- Gradient= (0.67195 − 1.0) * 1.0= −0.32805
- Weight update: 0.67195 + 0.032805= 0.704755

The above graph shows the loss value across 5 training iterations using backpropagation in a simple neural network.
We observe a steady decrease in loss from 0.125 down to 0.0538. This means that the network is learning. Each time the weight is updated using backpropagation, the output gets closer to the target, and the error becomes smaller.
References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- Ian Goodfellow & Yoshua Bengio & Aaron Courville (2016). Deep Learning (Adaptive Computation and Machine Learning series). MIT Press.
- “Welcome to Spinning Up in Deep RL!” – Open AI
Isaac Sim << Previous | Next >> Weight Initialization