This page was last edited on 05 March 2025
With Bellman Equation we estimate the total expected return (reward) from a state s. The state s is the state in which the agent is currently at time t. Bellman helps the agent to know, “If I go from here, what are the long-term rewards?“
In Deep Reinforcement Learning, we use neural networks to approximate value functions because the state space is large or continuous.
Bellman Equation helps us to:
- define the target value during training.
- perform bootstrapping: estimate current value from future predictions.
- learn optimal policies without full knowledge of the environment.

Two Core Bellman Equations
To truly understand how an agent learns from future rewards, we need to break Bellman down into its two core versions.
Each one answers a different question:
- What is the expected value if I follow this policy? → Fixed policy version.
- What is the best value I can get from any state? → Optimality version.
These equations are the foundation of value-based reinforcement learning.
They help the agent estimate not just the next step, but the long-term outcome — by combining immediate rewards and future possibilities.
Bellman Equation: From Dynamic Programming to Reinforcement Learning
In the years 1952 – 1957, Richard Bellman, an American mathematician, introduced the concept of Dynamic Programming. Bellman was looking for a method to optimize multi-stage decision processes. This is how the concept was born that a complex problem can be solved by decomposing it into recursive sub-problems.
Around 1960, Ronald Howard published ‘Dynamic Programming and Markov Processes‘. He combines Bellman’s work with Markov processes, thus appearing the Markov Decision Process (MDP) framework, which represents the standard structure we use today in AI to apply the Bellman Equation.
These two events led to a third one, namely the birth of Temporal Difference (TD) Learning in 1983 – 1988. Rich Sutton and Andrew Barto develop Temporal Difference-type methods. Until then, the Bellman Equation required perfect knowledge of the environment (model-based). Sutton proposed that the agent learn directly from experience, using the difference between the current estimation and the future one (the Bellman error). This is also the moment when the Bellman Equation moves from ‘paper mathematics’ to ‘machine learning’.
The fourth event, and the first one that made it possible for a computer to learn optimal behaviors without having a map of the environment’s probabilities, was the Q-Learning revolution in 1989. This concept was introduced by Chris Watkins, who applied the Bellman Equation directly to State-Action pairs Q(s, a).
The Bellman Equation for a fixed policy π
We use the Bellman Equation to evaluate an existing policy – that is, to find out how good a certain strategy is.
This means that:
- the policy π is fixed – we know exactly what action the agent takes in each state.
- we use the equation to calculate Vπ(s): the value of each state when you follow that policy.
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle V^{\pi}(s) = \mathbb{E}_{\pi} \left[ R_{t+1} + \gamma V^{\pi}(S_{t+1}) \,\middle|\, S_t = s \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-03b9459e8d5e715f1474c6f7a6919aa1_l3.png)
Where:
- Vπ(s): value of state s under policy π;
- Eπ: expected value when actions follow policy π;
- Rt+1: immediate reward;
- γ: discount factor (how much future rewards matter);
- St+1: next state;
The Bellman Optimality Equation
We use the Bellman Optimality Equation when we want to find the optimal value of states – meaning, the maximum possible cumulative rewards obtained from each state.
We use the Bellman Optimality Equation when:
- we don’t have a fixed policy.
- we’re looking for the best policy, which maximizes the value in each state.
- at each step we choose the action that gives the maximum value.
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle V^*(s) = \max_{a} \, \mathbb{E} \left[ R(s, a) + \gamma V^*(s') \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-581061e444566d85b562feb04ec2ec08_l3.png)
Where:
- V∗(s): optimal value of state s;
- maxa: choose action that gives highest return;
- R(s,a): reward for taking action a in state s;
- s′: next state;
- γ: discount factor;
- E: expected value over next state transitions;
Analogy
Think we’re using Google Maps to find the fastest route.
- Each location is a state.
- Each turn or road is an action.
- The estimated time to destination is the value of a state.
Maps estimate the total time based on our current location plus the time from the next location onward.
Bellman does the same: it updates the value of a state by combining what we get now + what we expect to get later.
Inputs and outputs of the Bellman Equation
| Inputs | State s, action a, reward R, discount factor γ, next state value V(s′) |
| Output | Updated value of the current state V(s) or state-action Q(s,a) |
We use this output as a target to train the value function approximator.
EXAMPLE: Applying the Bellman Equation through 5 manual iterations
Environment Setup:
- States: A, B, C
- Transitions:
- A → B, reward= 1
- B → C, reward= 2
- C → A, reward= 3
- Discount factor: γ= 0.9
- Initial values: V0(A)= V0(B)= V0(C)= 0
Bellman Update Formula:
For each state s, the Bellman update is:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle V(s) = R(s) + \gamma \cdot V(s') \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-6a723a30f1388095632da671f6dba2d0_l3.png)
Where:
- R(s): reward for the action from state s;
- V(s′): current estimated value of the next state;
- γ: discount factor;
ITERATION 1:
Initial values: V0(A)= 0, V0(B)= 0, V0(C)= 0
Formulas and calculations:
- V1(A)= 1+0.9*V0(B)=1+0.9*0= 1.0
- V1(B)= 2+0.9*V0(C)=2+0.9*0= 2.0
- V1(C)= 3+0.9*V0(A)=3+0.9*0= 3.0
ITERATION 2:
Initial values: V1(A)= 1.0, V1(B)= 2.0, V1(C)= 3.0
Formulas and calculations:
- V2(A)= 1+0.9*2.0= 2.8
- V2(B)= 2+0.9*3.0= 4.7
- V2(C)= 3+0.9*1.0= 3.9
ITERATION 3:
Initial values: V2(A)= 2.8, V2(B)= 4.7, V2(C)= 3.9
Formulas and calculations:
- V3(A)= 1+0.9*4.7= 5.23
- V3(B)= 2+0.9*3.9= 5.51
- V3(C)= 3+0.9*2.8= 5.52
ITERATION 4:
Initial values: V3(A)= 5.23, V3(B)= 5.51, V3(C)= 5.52
Formulas and calculations:
- V4(A)= 1+0.9*5.51= 5.959
- V4(B)= 2+0.9*5.52= 6.968
- V4(C)= 3+0.9*5.23= 7.707
ITERATION 5:
Initial values: V4(A)= 5.959, V4(B)= 6.968, V4(C)= 7.707
Formulas and calculations:
- V5(A)= 1+0.9*6.968= 7.271
- V5(B)= 2+0.9*7.707= 8.936
- V5(C)= 3+0.9*5.959= 8.363

Graph interpretation:
- The value of each state increases over time as rewards propagate.
- State A starts with the lowest value but eventually overtakes others due to its position in the reward cycle.
- State values stabilize gradually, demonstrating convergence through recursive Bellman updates.
- This plot illustrates how the agent learns better value estimates over time via repeated Bellman updates.
References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- David Silver (2015). Lectures on Reinforcement Learning.
- “Welcome to Spinning Up in Deep RL!” – Open AI.
- Bellman, R. (1957). Dynamic Programming. Princeton University Press.
- Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press.
- Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD Thesis, King’s College.
Problem Classification << Previous | Next >> Model Free Learning