AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
No Result
View All Result

Temporal difference(TD) example for Q-Learning and DQN

by Dragos Calin
in Deep RL Algorithms, DQN, Q-Learning, RL Fundamentals
4
A A
0

The main purpose of this tutorial is to explain how the Temporal Difference (TD) mechanism works. It is not just about knowing a mathematical formula. We want to understand the physical process of learning.

  • Where exactly does the learning occur?
  • Which is the concrete moment when the algorithm modifies its internal values?

Although all the RL algorithms – Q-Learning, DQN, SAC, and PPO – seem different, inside they use TD. Q-Learning uses tables. DQN uses neural networks. SAC and PPO have Actor-Critic architectures. All of them are based on the same operational principle, the TD mechanism.

By the end of the tutorial, we will see how the values are adjusted mathematically after each iteration — comparing the old prediction with the new information — for Q-Learning and DQN. We will see exactly when the agent learns from experience and adjusts its parameters.

Table of Contents

  • Why for Q-Learning and DQN?
  • What is Temporal Difference
  • When is TD applied
  • The origins of TD
  • TD in Q-Learning
  • TD in DQN
  • Wrapping Up

Why for Q-Learning and DQN?

In this tutorial, I have chosen to exemplify the mechanism of temporal difference learning (TD) using only Q-Learning and Deep Q-Network (DQN).

  • Q Learning is in a simple framework, without neural networks
  • Extension to a modern framework, with neural networks. That is DQN.

Starting with Q-Learning, we can clearly see the process: state, action, reward, Q update, TD error. All these elements are visible in a simple representation.

We then move on to DQN. This allows us to show how the same process is extended: the state can be vector or image, the parameter θ of the network is updated, but the moment of “learning” remains identifiable (target calculation, TD error, back-propagation).

The main goal of this tutorial is to show “where exactly does learning take place” and “what is the concrete moment”.

What is Temporal Difference(TD)

In Reinforcement Learning, or rather in any of the four algorithms, the goal is to estimate a value function: V(s) or Q(s,a). That is, the agent tries to find out the total reward that it expects at a certain state or at a combination of state and action. This is the basis of all RL algorithms: learning values ​​that guide decisions.

TD provides an elegant solution for most RL applications. It updates the estimate immediately after each step. This is different from other methods such as Monte Carlo that wait until the end of the episode to update the values. This strategy is inefficient or impossible to apply in continuous environments. The reason is simple: the agent does not have a clear end of the episode (for example, a robot that works non-stop). This means that it cannot wait for the “end of the episode” to update its values. So Monte Carlo is not practical in such cases.

TD allows us to “learn from the near future”. That is, the agent uses a prediction for the next step (what comes next immediately), not a complete experience (wait until the episode is over). Even if it is paradoxical, it is as efficient as possible. We learn from an estimated future, not from a real ending. This approach makes reinforcement learning incremental, online, and scalable.

When is TD applied?

TD is applied at each time step of the agent’s interaction with the environment. Every time after the agent takes an action, it receives a reward and reaches a new state, after which the TD-error is calculated. This moment is the essence of TD. The agent adjusts the values ​​(V or Q) based on the difference between what it expected and what it got.

TD is efficient in problems with large state spaces. When the state space is large, as we encounter in robotics, 3D games, or continuous environments, you cannot calculate all the complete returns. So incremental learning saves time and memory.

The origins of TD

Richard S. Sutton is the researcher who first introduced the idea of ​​learning through temporal differences. Without Sutton’s contribution in 1988, RL probably would not have evolved into modern methods (Q-Learning, Actor-Critic, DQN, etc.). TD is the core on which all these algorithms were built.

The word “temporal” refers to the time dimension. The learning process is dependent on the passage of time. It is done based on the changes between successive moments (t and t+1).

The word “difference” refers to the difference between successive estimates. The current prediction is compared with the updated one after receiving new feedback from the environment. This is how we calculate the TD error which is used for learning.

TD in Q-Learning

At each step in an episode, Q-Learning uses TD error to compare what it thought would happen with what actually happened. TD is the basic mechanism by which Q-Learning continually corrects its estimates of the future.

The complete Q-Learning formula:

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \left[ r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \right] \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Info: If you still don’t know the complete Q-Learning equation, access this detailed tutorial: What Is Q-Learning? Formula and Explanation.

The part that represents Temporal Difference (TD Error) – Q Learning

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle         \left[\, r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \,\right] \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

This is the TD (Temporal Difference Error) error: the difference between the updated value based on immediate feedback and the previous estimate.

In many papers it is denoted as:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle         \delta = r + \gamma \cdot \max_{a'} Q(s', a'; \theta^{-}) - Q(s, a; \theta) \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Representation with all components labeled

td in q learning
Natation: TD in Q-Learning

The steps of the TD update in Q-Learning

  1. The agent predicts how good an action is (the value Q(s,a)).
  2. After the action, it receives a real reward (r) and sees the new state (s’).
  3. Calculates what it should have gotten: estimate= r + γmaxa′​Q(s′,a′)
  4. Compare this new estimate with the previous prediction. The difference is the TD error: δ=(what happened)−(what was expected)
  5. Updates the value Q proportionally to this error.

EXAMPLE 1: TD process in Q-Learning

Example setup

We want to show how the Q-value changes over time, not just after one update.

  • States: S0, S1, S2
  • Actions: Right
  • Rewards:
    • From S0→ S1​: r= 0
    • From S1→ S2​: r= 0
    • From S2→ Terminal: r= +1
  • Learning rate: α= 0.5
  • Discount factor: γ= 0.9
  • Initial Q: all Q= 0
  • TD target: rt+ 1+ γmax⁡a′Q(st+1,a′)
  • TD error: δt= TDtarget−Q(st,at)
  • Update: Q(st,at) ← Q(st,at) + αδt

With bellow iterations, we’ll see how each time step (t → t+1) transfers reward information backward in time through the TD error. This is the “temporal” part.


ITERATION 1: time t = 0 –> 1

Transition: S0 –> S1, r1= 0

  • TDtarget = 0 + 0.9 max⁡Q(S1, a′) = 0
  • TDerror = 0 − 0 = 0
  • Update: Q(S0, Right) = 0

At t= 1, no change yet because the future reward is still unknown.


ITERATION 2: time t = 1 –> 2

Transition: S1 –> S2, r2= 0

  • TDtarget = 0 + 0.9 max⁡Q(S2, a′) = 0
  • TDerror = 0 − 0 = 0
  • Update: Q(S1, Right) = 0

Still no change at t= 2. The final reward has not yet arrived.


ITERATION 3: time t = 2 –> 3

Transition: S2 –> Finish_the_episode, r3= +1

  • TDtarget = 1 + 0.9 max⁡Q(S3, a′) = 1
  • TDerror = 1− 0 = 1
  • Update: Q(S2, Right)= 0 + 0.5 * 1 = 0.5

ITERATION 4: time t = 0 –> 1 (new episode begins)

Transition: S0 –> S1, r1= 0

Status of values ​​before the step: Q(S2​, Right) = 0.5, Q(S1, Right) = 0, Q(S0, Right) = 0

  • TDtarget = 0 + 0.9 max⁡Q(S1, a′) = 0 + 0.9 * 0 = 0
  • TDerror = 0 − Q(S0​, Right) = 0 − 0 = 0
  • Update: Q(S0, Right) = 0 + 0.5 * ⋅0 = 0

ITERATION 5: time t = 1 –> 2 (still in same new episode)

Transition: S1 –> S2, r2= 0

  • TDtarget = 0 + 0.9maxQ(S2, a′​) = 0 + 0.9 * 0.5 = 0.45
  • TDerror = 0.45 − Q(S1, Right) = 0.45 − 0 = 0.45
  • Update: Q(S1, Right​) = 0 + 0.5 * 0.45 = 0.225

CONCLUSION – EXAMPLE 1
EPISODETRANSITIONLOCAL TIMETD ErrorOBSERVATION
1S0 –> S10 –> 10No reward
1S1 –> S21 –> 20No reward
1S2 –> END2 –> 3+1TD updates Q(S₂): “this action was good!”
2S0 –> S10 –> 10The new episode begins. The agent is back at the starting state. Since S₁ still has no learned value, there is no update at this step.
2S1 –> S21 –> 2+0.45The information from the reward stored in S₂ (0.5) now propagates backward one step. The agent learns that being in S₁ is valuable because it leads to S₂, which leads to a reward.

When the agent receives a reward, it occurs at a point in time tₖ, This is after a specific action. But to truly learn, the agent must understand which previous actions and states led to that reward.

In other words, the agent must learn:

“It’s not just that the final action was good. The previous one contributed to the success.”

This is where “reward propagates backward” comes in.


TD in DQN

For DQN, the same TD mechanism is applied in Q-Learning. The difference is that instead of a Q-table, DQN uses a neural network that approximates the function Q(s,a;θ). θ represents the parameters (weights) of the network.

Thus, in DQN, TD Error no longer updates a value from a table. It serves as an error signal for updating the neural weights via backpropagation.

The goal is the same. Let’s learn which uses the current state to estimate the total value of future rewards.

The complete DQN formula:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{4mm} \\         \displaystyle         \mathcal{L}(\theta) =         \mathbb{E}         \Bigl[             \bigl(                 r_t                 +                  \gamma \, \max_{a'} Q(s_{t+1}, a'; \, \theta^{-})                 -                 Q(s_t, a_t; \, \theta)             \bigr)^2         \Bigr]         \vspace{4mm} \\     \end{array} } \]

Info: If you still don’t know the complete DQN equation, access this detailed tutorial: Deep Q Network (DQN) – Formula and Explanation

The part that represents Temporal Difference (TD Error) – DQN

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle         \left[         r + \gamma \cdot \max_{a'} Q(s',a'; \theta^{-}) - Q(s,a; \theta)         \right] \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

This is the standard form for TD Error in DQN. The difference from the Q-Learning version is that:

  • the network parameters θ and θ−,
  • Q(s,a;θ) is the predicted value of the online network,
  • Q(s′,a′;θ−) is the predicted value of the target network, used for stability.

In many papers, the TD Error for DQN is denoted as:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle         \delta = r + \gamma \cdot \max_{a'} Q(s',a'; \theta^{-}) - Q(s,a; \theta) \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

The steps of the TD update in DQN

  1. The agent predicts how good an action is – the value Q(s,a;θ) – using the online network,
  2. After executing the action, the agent receives a reward r and observes the next state s′,
  3. It builds the TD target using the target network:
    • y=r+γ*max​Q(s′,a′;θ−),
    • This equation is what the agent should have gotten, in accordance with the target network,
  4. Computes the TD error as the difference between what happened (target) and what was expected (prediction):
    • δ=y−Q(s,a;θ),
  5. Updates the online network’s parameters (θ) by minimizing the loss:
    • L(θ)=(y−Q(s,a;θ))2,
  6. Periodically updates the target network’s parameters (θ−) with the latest online parameters (θ−←θ).

EXAMPLE 2: TD process in DQN

Example setup

In this example, we show how the Q-value are changed over time when the DQN is used instead of a Q-table.

We use the same minimal environment to visualize how the reward propagates backward through the target network mechanism.

  • States: S₀, S₁, S₂
  • Actions: Right
  • Rewards:
    • From S₀ → S₁: r = 0
    • From S₁ → S₂: r = 0
    • From S₂ → Terminal: r = +1
  • Discount factor: γ = 0.9
  • Networks:
    • Online network → Q(s,a;θ)
    • Target network → Q(s,a;θ−)
  • Learning rate conceptual: α= 0.5
  • Initial values: All Q = 0
  • TD target: yt​=rt​ + γmaxa′​Q(st+1​,a′;θ−)
  • TD error: δt​=yt​ − Q(st​,at​;θ)
  • Update (online network): θ←θ+αδt​∇θ​Q(st​, at​;θ)

With the below iterations, we’ll see how each time step (t → t + 1) transfers reward information backward through the TD error.


ITERATION 1: time t = 0 –> 1

Transition: S0 –> S1, r1= 0

  • TD target = 0 + 0.9 × max Q(S₁, a′; θ⁻) = 0
  • TD error = 0 − Q(S₀, Right; θ) = 0 − 0 = 0
  • Update = no change (δ = 0)

At t = 1, we don’t have any update. The future reward is still unknown.


ITERATION 2: time t = 1 –> 2

Transition: S1 –> S2, r2= 0

  • TD target = 0 + 0.9 × max Q(S₂, a′; θ⁻) = 0
  • TD error = 0 − Q(S₁, Right; θ) = 0 − 0 = 0
  • Update = no change (δ = 0)

Still no learning. The final reward has not yet been reached.


ITERATION 3: time t = 2 –> 3

Transition: S2 –> Finish_the_episode, r3= +1

  • TD target = 1 + 0.9 × 0 = 1
  • TD error = 1 − Q(S₂, Right; θ) = 1 − 0 = 1
  • Update = Q(S₂, Right; θ) increases (conceptually ≈ 0.5 after one step)

At the end of the episode, the online network has learned that S₂ leads to a positive reward. Now perform the hard target update: θ−←θ.


ITERATION 4: time t = 0 –> 1 (new episode begins)

Transition: S0 –> S1, r1= 0

  • TD target = 0 + 0.9 × max Q(S₁, a′; θ⁻) = 0
  • TD error = 0 − Q(S₀, Right; θ) = 0 − 0 = 0
  • Update = no change

At this step, the agent do not take into consideration the S₀ value. Because S₁ still has no learned target value.


ITERATION 5: time t = 1 –> 2 (still in same new episode)

Transition: S1 –> S2, r2= 0

  • TD target = 0 + 0.9 × max Q(S₂, a′; θ⁻) = 0.9 × 0.5 = 0.45
  • TD error = 0.45 − Q(S₁, Right; θ) = 0.45 − 0 = 0.45
  • Update = Q(S₁, Right; θ) increases (conceptually ≈ 0.225 after one update)

Now the reward information from S₂ has propagated backward to S₁ through the TD target computed with the synchronized target network.


CONCLUSION – EXAMPLE 2
EPISODETRANSITIONLOCAL TIMETD ErrorOBSERVATION
1S0 –> S10 –> 10No reward
1S1 –> S21 –> 20Still no reward
1S2 –> END2 –> 3+1DQN learns that S₂ is valuable (then target is updated)
2S0 –> S10 –> 10No update yet
2S1 –> S21 –> 2+0.45Reward from S₂ propagates backward to S₁ through the TD target

Wrapping Up

We can conclude that the fundamental principle of TD is that the algorithm does not wait for the end of the episode to learn. They use estimates of the future value to update the current predictions (“bootstrapping”).

Whether we are talking about Q-Learning or DQN, the learning mechanism has the same structure:

prediction → action + observation → target calculation → error (TD error) → internal update

The main difference between Q-Learning (table) and DQN (neural network) lies in the representation medium (Q-table vs. network) and the implementation of the update (direct vs. loss minimization through backpropagation).

The importance of the target network (in DQN) is that it ensures stability by decoupling the current prediction from the target that is updated periodically. This is a mechanism that allows the propagation of rewards back in time in a neural network framework.

Advantages and limitations

  • TD provides online updates,
  • can work in a continuous environment, without waiting for the episode,
  • but may be more sensitive to parameters (α, γ),
  • in unstable functional approximations (e.g. neural networks) convergence problems may arise.

In conclusion, the actual learning takes place exactly at the moment when the TD error is calculated and used for updating: this is the “moment” in which the agent “learns” from its environment.

Tags: Q-FunctionTemporal Difference
ShareTweetShareShareSend
Previous Post

Step-by-Step Tutorial: Q-Learning Example with CartPole

Next Post

Discount Factor (gamma) Explained With Q-Learning + CartPole

Related Posts

What is Actor-Critic in Reinforcement Learning?
Deep RL Algorithms

What is Actor-Critic in Reinforcement Learning?

January 20, 2026
Exploration vs Exploitation in MDP
OpenAI Gymnasium

Exploration vs Exploitation in RL Explained with FrozenLake and DQN

February 27, 2026
Next Post
Discount Factor (gamma) Explained With Q-Learning + CartPole

Discount Factor (gamma) Explained With Q-Learning + CartPole

Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium

Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium

Deep Q-Learning – Build, Train, and Visualize with PyTorch, Gymnasium, and SB3

Deep Q-Learning - Build, Train, and Visualize with PyTorch, Gymnasium, and SB3

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About the author

About Dragos Calin

Dragos Calin is a robotics engineer and reinforcement learning practitioner focused on building real-world autonomous and remote-controlled robotics for agriculture, edge-AI robotics, and embedded platforms. His work join simulation, machine learning, and hardware deployment, with a strong emphasis on practical, testable solutions that function outside the lab.

Areas of Expertise:

  • # Reinforcement Learning for Robotics
  • # Autonomous Agricultural Robots
  • # Embedded Systems & Edge AI (Jetson, Raspberry Pi, Arduino)
  • # Robotic Simulation & Sim2Real Workflow
  • # Sensor Fusion & Control Systems
  • # ROS-Based Robotics Development

Tags

Actor-Critic Bellman Equation Evaluation Metrics Exploitation Exploration Hyperparameter Tuning Machine Learning Markov Decision Process MDP MDP (Markov Decision Process) Normalization Partial Observability POMDP Q-Function Replay Buffer Temporal Difference TensorBoard
Newsletter

Subscribe Blog for Latest Updates

To stay updated with our newest projects and tutorials, make sure you subscribe to our newsletter. 

We do not share your information! You can subscribe  at any time. By subscribing you agree to our Privacy Policy.

Stay Tuned – Follow Us

To stay updated with our newest projects and tutorials, make sure you follow us on: Twitter / X

Site Information

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 Reinforcement Learning Path

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
      • CLASSIC DEEP RL APPLICATION
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3

© 2026 Reinforcement Learning Path