Gradients: Value Function Gradient and Temporal Difference (TD)

This page was last edited on 08 November 2025

In the previous tutorial, we saw how the derivative of a function f(x) at a point x tells us the rate of change in the function and the direction in which the function changes. The gradient is an extension of the derivative for functions with more variables and tells us in which direction we need to move to maximize/minimize the function.

The gradient improves the agent’s decisions by adjusting the parameters of the policy or value function to maximize the cumulative reward received.

What does the gradient of a function indicate?

  • Direction of change – tells us whether the function is increasing or decreasing.
  • Magnitude of the gradient – tells us how fast the function is changing.

The main applications of gradients in Reinforcement Learning are:

  • In Policy Optimization, gradients are used to directly optimize a policy. The gradient is calculated based on the objective function to improve the agent’s performance.
  • Another example is in Q-Learning and DQN. The gradient is used in optimizing the value function, where the loss is defined by the Mean Squared Error (MSE) between the estimated Q value and the Bellman target.
  • In Exploration vs. Exploitation, by adjusting the weights of the neural networks, the gradient helps improve the estimates of future rewards. This improvement leads the agent to better policies.
  • In Actor-Critic algorithms, the gradient balances learning between the actor (policy) and the critic (value function), preventing large oscillations in updates.

ANALOGY

We can look at the gradient as a mountain guide that helps a hiker reach the top of the mountain. Without a guide (the gradient), the hiker would randomly try to climb, sometimes ending up on the wrong paths.

Let’s try a simple example to gain a deeper understanding of gradients and their roles in Reinforcement Learning.

STEP 1: Gradient formula

The gradient of a function f(x) is its derivative:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(x) = \frac{d}{dx} f(x) \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

We will use the simple function:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle f(x) = x^2 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

The gradient of this function is:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(x) = \frac{d}{dx} (x^2) = 2x \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

This formula tells us how the function changes as a function of x.

Where:

  • ∇f(x) – the gradient of the function.
  • x – the point where we calculate the gradient.

STEP 2: Applying the Gradient on 5 Iterations

We choose an initial value x0=4 and calculate the gradient for 5 iterations.

ITERATION 1

Gradient at x=4

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(4) = 2 \times 4 = 8 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 2

Gradient at x=2

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(2) = 2 \times 2 = 4 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 3

Gradient at x=0

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(0) = 2 \times 0 = 0 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 4

Gradient at x=-2

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(-2) = 2 \times (-2) = -4 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 5

Gradient at x=-4

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla f(-4) = 2 \times (-4) = -8 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


This is what the graph showing the gradient of the function f(x)=x2 for the 5 iterations looks like:

Gradient at different points on f(x)
This graph illustrates the gradient of the function f(x) at five different points.

Observations:

  • For x>0, the gradient is positive, meaning the function increases.
  • For x<0, the gradient is negative, meaning the function decreases.
  • For x=0, the gradient is zero, meaning the critical point (global minimum).

In Reinforcement Learning (RL) there are several types of gradients, as each has a specific role in indicating the direction and magnitude of parameter adjustments.

1. Value Function Gradient

A Value Function measures how good it is for an agent to be in a state (or state-action pair), indicating the expected future reward when a certain policy is followed.

Simple gradients are general-purpose and don’t capture the dynamics and temporal dependencies present in RL tasks. The Value Function Gradient, is working in RL to optimize and relates policy parameters to maximizing the cumulative reward.

ANALOGY

Imagine the sails on a sailboat:

  • The environment: the direction of the wind.
  • Policy parameters: the sail angle.
  • Value function (reward): boat speed.

Value Function Gradient is like measuring how the speed changes by small adjustments to the sail angle. The goal is to find the sail angle (policy parameters) that maximizes the speed (cumulative reward) of the boat. Adjusting the sail based on how the speed changes (gradient) leads to the optimal position.

The general formula for Value Function Gradient is:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla_{\theta} V^{\pi_{\theta}}(s) =          \mathbb{E}_{\pi_{\theta}} \left[         \sum_{t=0}^{\infty} \nabla_{\theta} \log \pi_{\theta} (a_t | s_t) Q^{\pi_{\theta}}(s_t, a_t) \Bigg| s_0 = s         \right] \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Where:

  • ∇θ​Vπθ​(s): the gradient of the value function with respect to policy parameters θ.
  • Eπθ​​[⋅]: expectation taken over all actions and states sampled from the policy πθ.
  • : sum over all future timesteps.
  • θ​logπθ​(at​∣st​): gradient of the log probability of taking action at in state st.
  • Qπθ​(st​,at​): state-action value function, indicating expected cumulative rewards.
Step 1: Defining a Simple Scenario

For manual calculations, we’ll simplify the formula to a scenario with:

  • One timestep, t=0.
  • No stochasticity (single deterministic policy action).
  • Simple policy and value functions.

The simplified form becomes:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla_{\theta} V(\theta) =          \nabla_{\theta} \log \pi_{\theta} (a | s) \cdot Q(s, a) \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

To perform manual calculations, let’s define a simple scenario.

Consider a simple policy parameterized by a single parameter θ:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \pi_{\theta}(a | s) = \frac{1}{1 + e^{-\theta}} \quad \text{(sigmoid policy)} \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Assume a constant state-action value function Q(s,a)=5 (for simplicity).

Our gradient becomes:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla_{\theta} V(\theta) =          \nabla_{\theta} \log \pi_{\theta} (a | s) \cdot Q(s, a) \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Step 2: Manual Calculations for Five Iterations

Initial Conditions:

  • Initial parameter: θ0=0.
  • Learning rate: α=0.1.
  • Q(s,a)=5 (constant).

Iteration Steps:

We perform the gradient ascent update:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \theta_{\text{next}} = \theta_{\text{current}} + \alpha \cdot \nabla_{\theta} V(\theta_{\text{current}}) \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

The derivative for sigmoid:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \nabla_{\theta} \log \pi_{\theta}(a | s) = 1 - \pi_{\theta}(a | s) \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

ITERATION 1

Current θ0=0

Compute πθ​(a∣s):

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \pi_0 (a | s) = \frac{1}{1 + e^{-0}} = 0.5 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Compute gradient:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle 1 - 0.5 = 0.5 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Update parameter:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \theta_1 = 0 + 0.1 \cdot (5 \times 0.5) = 0.25 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 2

Current θ1=0.25

Compute πθ​(a∣s):

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \pi_{0.25} (a | s) = \frac{1}{1 + e^{-0.25}} \approx 0.5622 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Compute gradient:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle 1 - 0.562 = 0.438 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Parameter update:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \theta_2 = 0.25 + 0.1 \cdot (5 \times 0.438) = 0.25 + 0.219 = 0.469 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 3

Current θ2=0.469

Compute πθ​(a∣s):

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \pi_{0.469} (a | s) = \frac{1}{1 + e^{-0.469}} \approx 0.615 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Gradient:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle 1 - 0.615 = 0.385 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Parameter update:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \theta_3 = 0.469 + 0.1 \cdot (5 \times 0.385) = 0.469 + 0.1925 = 0.6615 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 4

Current θ3=0.6615

Compute πθ​(a∣s):

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \pi_{0.6615} (a | s) = \frac{1}{1 + e^{-0.6615}} \approx 0.659 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Gradient:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle 1 - 0.659 = 0.341 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Parameter update:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \theta_4 = 0.6615 + 0.1 \times (5 \times 0.341) = 0.6615 + 0.1705 = 0.832 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


ITERATION 5

Current θ4=0.832

Compute πθ​(a∣s):

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \pi_{0.832} (a | s) = \frac{1}{1 + e^{-0.832}} \approx 0.696 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Gradient:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle 1 - 0.696 = 0.304 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Parameter update:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle \theta_5 = 0.832 + 0.1 \times (5 \times 0.304) = 0.832 + 0.152 = 0.984 \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]


Below is the graphical representation, showing how the parameter θ increases at each iteration toward maximizing the expected value:

Value Function Gradient: Parameter Updates (Iterations 0-5)
Value Function Gradient: Parameter Updates (Iterations 0-5)

If we look at the graph, we notice that the value of the parameter θ increases steadily from one iteration to the next. This behavior confirms that the Value Function Gradient optimizes the policy parameters in a direction where the value (expected reward) becomes larger.


Temporal Difference (TD) Gradient updates value estimates based on current predictions rather than waiting for an entire episode to finish, making it more efficient in online learning scenarios.

It uses elements of Monte Carlo (MC) methods (which learn from complete episodes) and Dynamic Programming (DP) (which updates values iteratively using bootstrapping).

The general formula for Temporal Difference (TD) Gradient is:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle TD(t) = r_t + \gamma V(s_{t+1}) - V(s_t) \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Where:

  • rt is the reward at time t.
  • γ is the discount factor (0≤γ≤1).
  • V(st) is the estimated value of state st.
  • V(st+1) is the estimated value of the next state st+1.

The TD gradient update rule is:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\          \displaystyle V(s_t) \leftarrow V(s_t) + \alpha \cdot TD(t) \\           \vspace{5mm}      \end{array} }  \hspace{5mm} \]

Where:

  • α is the learning rate.

ANALOGY

When you learn to drive a car with a GPS navigation system, you adjust your driving at each turn based on what the GPS tells you about upcoming roads. If you see a wrong turn ahead, you correct early rather than waiting to reach the destination.

Considering a simple RL environment where an agent moves between three states:

  • State A → State B → State C
  • The agent receives rewards of 0 for A and B but gets +1 at C
  • We initialize the state-value function arbitrarily: V(A)=0.5, V(B)=0.3, V(C)=0
  • Discount factor γ=1, learning rate α=0.1

ITERATION 1

At A:
TD(A)=0+1⋅V(B)−V(A)=0.3−0.5=−0.2
V(A)←V(A)+α⋅TD(A)=0.5+0.1×(−0.2)=0.48

At B:
TD(B)=0+1⋅V(C)−V(B)=0−0.3=−0.3
V(B)←V(B)+0.1×(−0.3)=0.27

Updated values after iteration 1: V(A)=0.48,V(B)=0.27,V(C)=0


ITERATION 2

At A:
TD(A)=0+1⋅V(B)−V(A)=0.27−0.48=−0.21
V(A)←0.48+0.1×(−0.21)=0.459

At B:
TD(B)=0+1⋅V(C)−V(B)=0−0.27=−0.27
V(B)←0.27+0.1×(−0.27)=0.243

Updated values after iteration 2: V(A)=0.459,V(B)=0.243,V(C)=0


ITERATION 3

At A:
TD(A)=0+1⋅V(B)−V(A)=0.243−0.459=−0.216
V(A)←0.459+0.1×(−0.216)=0.4374

At B:
TD(B)=0+1⋅V(C)−V(B)=0−0.243=−0.243
V(B)←0.243+0.1×(−0.243)=0.2187

Updated values after iteration 3: V(A)=0.4374,V(B)=0.2187,V(C)=0


ITERATION 4

At A:
TD(A)=0+1⋅V(B)−V(A)=0.2187−0.4374=−0.2187
V(A)←0.4374+0.1×(−0.2187)=0.41553

At B:
TD(B)=0+1⋅V(C)−V(B)=0−0.2187=−0.2187
V(B)←0.2187+0.1×(−0.2187)=0.19683

Updated values after iteration 4: V(A)=0.41553,V(B)=0.19683,V(C)=0


ITERATION 5

At A:
TD(A)=0+1⋅V(B)−V(A)=0.19683−0.41553=−0.2187
V(A)←0.41553+0.1×(−0.2187)=0.39366

At B:
TD(B)=0+1⋅V(C)−V(B)=0−0.19683=−0.19683
V(B)←0.19683+0.1×(−0.19683)=0.177147

Updated values after iteration 5: V(A)=0.39366,V(B)=0.177147,V(C)=0


This example demonstrates how TD gradient updates iteratively refine the state-value estimates towards their true values.

Simple Temporal Difference (TD) Gradient example
Simple Temporal Difference (TD) Gradient example

The graph above visually demonstrates how the values of states A,
B, and C change over the 5 iterations using the Temporal Difference (TD) Gradient method.

In RL applications, the Temporal Difference (TD) Gradient is applied in cases like:

  • TD gradient methods enable robots to continuously adjust their paths based on immediate sensor feedback and predicted future states, allowing adaptive navigation in dynamic environments.
  • TD gradient algorithms are used to predict future game outcomes from intermediate game states, helping AI models improve decision-making through incremental updates.
  • Self-driving cars utilize TD methods to update their driving policies in real-time based on current state information and future reward predictions, improving safety and efficiency.
  • Allow game agents to adapt dynamically to player actions and adjust tactics by learning incrementally at each time step.
  • Facilitate incremental estimation of asset values and expected returns, providing immediate adjustments based on current market conditions.

References:


Derivatives << Previous | Next >> Spaces