AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
No Result
View All Result

Step-by-Step Tutorial: Q-Learning Example with CartPole

by Dragos Calin
in Q-Learning, RL Fundamentals
4
A A
0

At the end of this tutorial, you will understand how the Q-values ​​are updated in Q-Learning for the CartPole task. It is a tutorial for those who want to understand Q-Learning without complicated code.

All calculations are done numerically, step by step. So you can work manually without complex simulations.

The goal is to learn how Q-Learning works in depth. How states, actions and rewards are linked to form learning experiences. We will explore the role of each key parameter. And we will see how it affects the final result.

Table of Contents

  • What is CartPole?
  • Why do we use Q-Learning to control CartPole?
  • Q-Learning Algorithm Explained with Manual Steps on Simplified CartPole
  • What did the agent “learn”?

What is CartPole?

CartPole is a small cart that moves in a straight line. A thin pole is attached to the cart. The pole is affected by gravity and falls. Our role is to teach an RL agent to move the cart so that the pole stays straight up.

For the pole to stay vertical, the cart must move left or right. The pole that is swinging on the cart has continuous observations (cart position, speed, pole angle, angular velocity). For manual calculation, these observations have been discretized into 4 states (S0-S3).

The 4 states are a good approximation for the demo. But in practice, binning or tile coding is used for continuous states.

Why do we use Q-Learning to control CartPole?

Q-Learning is the reinforcement learning equivalent of building habits. It is a basic, model-free algorithm that directly learns the values ​​of Q(s, a) through temporal updates (TD). It does not require an explicit model of the environment. This aspect makes it ideal for tutorials and prototyping on CartPole.

The agent doesn’t need to understand the full environment. It needs to be consistent. With each step, its decisions get a little better. Every time the agent takes an action and receives feedback. After this step, it updates its internal memory – the Q-table. It’s like a person who adjusts their habits after each small success or failure.

The ideas above highlight how Q-Learning turns a seemingly simple problem (balancing a pole) into an exercise in autonomous learning.

Q-Learning Algorithm Explained with Manual Steps on Simplified CartPole

In this chapter, we’ll go step by step through a simplified CartPole example. The Q-values are changed and you’ll see exactly how.

Problem setup (simplified version)

Possible states:

CodeSimplified description
S0the pole is almost vertical ( | )
S1the pole is slightly tilted to the left ( \ )
S2the pole is slightly tilted to the right ( / )
S3the pole has fallen ( ___ ) -> episode finished

Initial pole position

At the beginning of the training, we start from state S0 (the pole is almost vertical).


Info: These states are simplified. We’re using just four categories to make Q-learning super easy to understand.

In the standard Gymnasium setup, the state is all fluid and continuous. It contains four key numbers:

  • the cart’s position,
  • its speed,
  • the pole’s angle,
  • and how quickly that angle’s changing.

For using Q-learning in those endless continuous spaces, we have to do the discretization of the values. Slicing everything into neat, finite chunks.


Exploration policy ε: 0

Info: It balances exploration and exploitation. It choose between trying random or new actions to discover the environment or selecting the action with the highest estimated value from the Q-table.

With ε=0, we always choose the action with max Q(s, a) from the current state.

  • High exploration → the agent learns more about the environment. In this case, it may take longer to converge to an optimal policy.
  • Low exploration → the agent quickly exploits known good actions. It risks getting stuck in suboptimal behaviors.

We chose manual selection (equivalent to 0% exploration) for clarity. Without exploration allows us to demonstrate the Q-learning updates step-by-step without randomness interfering with the calculations.

In practice, a common approach is epsilon-greedy with ε starting at 1.0 and decaying to 0.01.


Possible actions:

  • A0 = Move Left (<—)
  • A1 = Move Right (—>)

Info: In the real CartPole application in Gymnasium, the action space is discrete and binary. It is like these two possible actions, with no additional options in the standard version.


Rewards:

  • +1 if the pole is kept vertical (S0, S1, S2)
  • -1 if the pole has fallen (S3)

Info: In the real CartPole application in Gymnasium (or other frameworks), the reward is given automatically by the environment.


Learning rate α: 0.5

Info: It controls how much new information influences the old value in the Q-table.

  • Large α → the agent adapts quickly to new experiences, but can be unstable.
  • Small α → learning is slow, but stable.

We chose α = 0.5 for clarity (meaning: 50% of the new value + 50% of the old value). In practice, common values ​​are between 0.1 and 0.5.


Discount factor γ: 0.9

Info: It controls how important future rewards are compared to immediate ones.

  • γ close to 0 → the agent is “myopic”, only pursuing the immediate reward.
  • γ close to 1 → the agent is “visionary”, taking into account long-term effects.

We chose γ = 0.9 (classical value) to show that the agent also takes into account future steps, not just the current reward.


Q-Learning Formula

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \left[ r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \right] \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • Q(s,a): current Q value for the state-action pair
  • α: learning rate
  • γ: discount factor
  • r: reward
  • s′: next state
  • a′: next action
  • maxa′​Q(s′,a′): the maximum estimated value for the next state, according to the “greedy” principle of Q-Learning

Above is the fundamental equation of the Q-Learning algorithm. It describes how the Q values ​​are updated based on the agent’s experiences in the environment.

Now I have added an image of the same equation. The goal is to clarify what each Q in this equation represents in order to better understand how the updating works.

Q learning equation
Q learning equation
  1. The first Q(s,a) -> It is the new value, the one that will be saved in the Q-table after the update.
  2. The second Q(s,a) → It is the old value, the one already existing in the Q-table before the update.
  3. The third Q(s,a) → It is the same old value, but used in the calculation of the TD (Temporal Difference) error.
  4. The fourth max(Q(s′, a′)) -> For the calculation, we take all possible actions from the following state S1, see which one has the highest value and use it in the equation.

Q-table initialization

Q-table initialization also applies to the real application. Q-Learning uses a Q-table to store the estimated values ​​of Q(s,a). In both the real application and this example, we will initialize the Q-table with zeros.

StateA0 (Move Left)A1 (Move Right)
S000
S100
S200
S300

The 10-step sequence

Up to this point, we’ve explored what Q-Learning is and the setup of the application. Now it’s time to see how it learns.

We’ll manually compute 10-steps and watch how the Q-values evolve in real time.

The correct order of events in a Q-learning episode

  • 1 -> current state s
  • 2 -> current action a
  • 3 -> apply the action (environment responds)
  • 4 -> receive reward r
  • 5 -> observe next state s′
  • 6 -> calculate Q(s,a)
  • 7 -> choose next action a′ (for the next step)

Constant learning parameters (applied in all steps)

  • Learning rate α: 0.5 (remains fixed to balance adaptability and stability)
  • Discount factor γ: 0.9 (values ​​future rewards constantly, throughout training)
STEP 1

Experience tuple:

  • actual state s: S0 -> we start from state S0 (the pole is almost vertical).
  • actual action a: A1 (Move Right) -> we chose to move to the right. This means that the cart is located a little to the right. What happens to the pole? If we push it to the right, it tilts slightly in the opposite direction. It becomes slightly tilted to the left ( \ ).
  • reward r: +1 -> because the pole hasn’t fallen, the reward is positive.
  • next state s′ = S1 -> because the pole is slightly tilted to the left, the next state is S1.
  • old Q(S0, A1) = 0 -> from Q-table initialization
  • Q(S1, A0) = 0 -> from Q-table initialization
  • Q(S1, A1) = 0 -> from Q-table initialization
Step 1 - actual state S0 - next state S1 - action A1
Step 1: actual state S0, next state S1, action A1

We calculate the value Q(S0, A1) for step 1

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle Q(S_0, A_1) = 0 + 0.5 \times [1 + 0.9 \times \max(Q(S_1, A_0), Q(S_1, A_1)) - 0] \\         \vspace{3mm} \\         \displaystyle = 0.5 \times [1 + 0.9 \times 0] = 0.5 \times 1 = 0.5 \\         \vspace{2mm}     \end{array} } \]

The Q-table after step 1

StateA0 (Move Left)A1 (Move Right)
S000.500
S100
S200
S300

STEP 2

Choose next action for the second step: a’ = A0

Experience tuple:

  • actual state s: S1 -> the cart was moved to the right in the previous step
  • actual action a: A0 (Move Left) -> we chose to move to the left. This means that the cart is located a little to the left, while the pole becomes slightly tilted to the right ( / ).
  • reward r: +1 -> because the pole hasn’t fallen, the reward is positive.
  • next state s′ = S0 -> the pole is again almost vertical
  • old Q(S1, A0) = 0 -> from Q-table step 1
  • Q(S0, A0) = 0.000 -> from Q-table step 1
  • Q(S0, A1) = 0.500 -> from Q-table step 1
Step 2 - actual state S1 - next state S0 - action A0
Step 2: actual state S1, next state S0, action A0

We calculate the value Q(S1, A0) for step 2

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle Q(S_1, A_0) = 0 + 0.5 \times [1 + 0.9 \times \max(Q(S_0, A_0), Q(S_0, A_1)) - 0] \\         \vspace{3mm} \\         \displaystyle = 0.5 \times [1 + 0.9 \times 0.500] = 0.5 \times 1.45 = 0.725 \\         \vspace{2mm}     \end{array} } \]

The Q-table after step 2

StateA0 (Move Left)A1 (Move Right)
S000.500
S10.7250
S200
S300

STEP 3

Choose next action for the third step: a’ = A1

Experience tuple:

  • actual state s: S0
  • actual action a: A1 (Move Right) -> the pole is tilted too far to the left
  • reward r: -1 -> because the pole has fallen, the reward is negative
  • next state s′ = S3 (episode finished)
  • old Q(S0, A1) = 0.500 -> from Q-table step 2
  • Q(S3, A0) = 0.000 -> from Q-table step 2
  • Q(S3, A1) = 0.000 -> from Q-table step 2
Step 3 - actual state S0 - next state S3 - action A1
Step 3: actual state S0, next state S3, action A1

We calculate the value Q(S0, A1) for step 3

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle Q(S_0, A_1) = 0.500 + 0.5 \times [-1 + 0.9 \times \max(Q(S_3, A_0), Q(S_3, A_1)) - 0.500] \\         \vspace{3mm} \\         \displaystyle = 0.500 + 0.5 \times \big[-1 - 0.500 \big] = 0.500 + 0.5 \times (-1.5)  = -0.250 \\         \vspace{2mm}     \end{array} } \]

The Q-table after step 3

StateA0 (Move Left)A1 (Move Right)
S00-0.250
S10.7250
S200
S300

STEP 4

The episode has finished at step 3. We start a new episode.

Experience tuple:

  • actual state s: S0
  • actual action a: A1 (Move Right) -> the pole is tilted to the left
  • reward r: +1 -> because the pole hasn’t fallen, the reward is positive.
  • next state s′ = S1
  • old Q(S0, A1) = -0.250 -> from Q-table step 3
  • Q(S1, A0) = 0.725 -> from Q-table step 3
  • Q(S1, A1) = 0.000 -> from Q-table step 3
Step 4 - actual state S0 - next state S1 - action A1
Step 4: actual state S0, next state S1, action A1

We calculate the value Q(S0, A1) for step 4

    \[\hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(S_0, A_1) = -0.250 + 0.5 \times \big[ 1 + 0.9 \times \max(Q(S_1, A_0), Q(S_1, A_1)) - (-0.250) \big] \\ \vspace{3mm} \\ \displaystyle = -0.250 + 0.5 \times (1 + 0.6525 + 0.250) = -0.250 + 0.95125 = 0.70125 \\ \vspace{2mm} \end{array} }\]

The Q-table after step 4

StateA0 (Move Left)A1 (Move Right)
S000.701
S10.7250
S200
S300

STEP 5

Choose next action for the fifth step: a’ = A0

Experience tuple:

  • actual state s: S1
  • actual action a: A0 (Move Left)
  • reward r: +1
  • next state s′ = S0
  • old Q(S1, A0) = 0.725 -> from Q-table step 4
  • Q(S0, A0) = 0.000 -> from Q-table step 4
  • Q(S0, A1) = 0.701 -> from Q-table step 4
Step 5: actual state S1, next state S0, action A0
Step 5: actual state S1, next state S0, action A0

We calculate the value Q(S1, A0) for step 5

     \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(S_1, A_0) = 0.725 + 0.5 \times \big[ 1 + 0.9 \times \max(Q(S_0, A_0), Q(S_0, A_1)) - 0.725 \big] \\ \vspace{3mm} \\ \displaystyle = 0.725 + 0.5 \times (1 + 0.631125 - 0.725) = 0.725 + 0.4530625 = 1.1780625 \\ \vspace{2mm} \end{array} } \]

The Q-table after step 5

StateA0 (Move Left)A1 (Move Right)
S000.701
S11.1780
S200
S300

STEP 6

Choose next action for the sixth step: a’ = A0

Experience tuple:

  • actual state s: S0
  • actual action a: A0 (Move Left)
  • reward r: +1
  • next state s′ = S2
  • old Q(S0, A0) = 0.000 -> from Q-table step 5
  • Q(S2, A0) = 0.000 -> from Q-table step 5
  • Q(S2, A1) = 0.000 -> from Q-table step 5
Step 6: actual state S0, next state S2, action A0
Step 6: actual state S0, next state S2, action A0

We calculate the value Q(S0, A0) for step 6

     \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(S_0, A_0) = 0.000 + 0.5 \times \big[ 1 + 0.9 \times \max(Q(S_2, A_0), Q(S_2, A_1)) - 0.000 \big] \\ \vspace{3mm} \\ \displaystyle = 0.000 + 0.5 \times ( 1 + 0 - 0.000 ) = 0.000 + 0.5 \times 1 = 0.500 \\ \vspace{2mm} \end{array} } \]

The Q-table after step 6

StateA0 (Move Left)A1 (Move Right)
S00.5000.701
S11.1780
S200
S300

STEP 7

Choose next action for the sixth step: a’ = A1

Experience tuple:

  • actual state s: S2
  • actual action a: A1 (Move Right)
  • reward r: +1
  • next state s′ = S0
  • old Q(S2, A1) = 0.000 -> from Q-table step 6
  • Q(S0, A0) = 0.500 -> from Q-table step 6
  • Q(S0, A1) = 0.701 -> from Q-table step 6
Step 7: actual state S2, next state S0, action A1
Step 7: actual state S2, next state S0, action A1

We calculate the value Q(S2, A1) for step 7

     \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(S_2, A_1) = 0.000 + 0.5 \times \big[ 1 + 0.9 \times \max(Q(S_0, A_0), Q(S_0, A_1)) - 0.000 \big] \\ \vspace{3mm} \\ \displaystyle = 0.000 + 0.5 \times ( 1 + 0.631 - 0.000 ) = 0.000 + 0.5 \times 1.631 = 0.8155 \\ \vspace{2mm} \end{array} } \]

The Q-table after step 7

StateA0 (Move Left)A1 (Move Right)
S00.5000.701
S11.1780
S200.815
S300

STEP 8

Choose next action for the eighth step: a’ = A1

Experience tuple:

  • actual state s: S0
  • actual action a: A1 (Move Right)
  • reward r: +1
  • next state s′ = S1
  • old Q(S0, A1) = 0.701 -> from Q-table step 7
  • Q(S1, A0) = 1.178 -> from Q-table step 7
  • Q(S1, A1) = 0.000 -> from Q-table step 7
Step 8 - actual state S0 - next state S1 - action A1
Step 8: actual state S0, next state S1, action A1

We calculate the value Q(S0, A1) for step 8

     \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(S_0, A_1) = 0.701 + 0.5 \times \big[ 1 + 0.9 \times \max(Q(S_1, A_0), Q(S_1, A_1)) - 0.701 \big] \\ \vspace{3mm} \\ \displaystyle = 0.701 + 0.5 \times ( 1 + 1.060 - 0.701 ) = 0.701 + 0.6795 = 1.3805 \\ \vspace{2mm} \end{array} } \]

The Q-table after step 8

StateA0 (Move Left)A1 (Move Right)
S00.5001.380
S11.1780
S200.815
S300

STEP 9

Choose next action for the eighth step: a’ = A0

Experience tuple:

  • actual state s: S1
  • actual action a: A0 (Move Left)
  • reward r: +1
  • next state s′ = S0
  • old Q(S1, A0) = 1.178 -> from Q-table step 8
  • Q(S0, A0) = 0.500 -> from Q-table step 8
  • Q(S0, A1) = 1.380 -> from Q-table step 8
Step 9: actual state S1, next state S0, action A0
Step 9: actual state S1, next state S0, action A0

We calculate the value Q(S1, A0) for step 9

     \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(S_1, A_0) = 1.178 + 0.5 \times \big[ 1 + 0.9 \times \max(Q(S_0, A_0), Q(S_0, A_1)) - 1.178 \big] \\ \vspace{3mm} \\ \displaystyle = 1.178 + 0.5 \times ( 1 + 1.242 - 1.178 ) = 1.178 + 0.532 = 1.710 \\ \vspace{2mm} \end{array} } \]

The Q-table after step 9

StateA0 (Move Left)A1 (Move Right)
S00.5001.380
S11.7100
S200.815
S300

STEP 10

Choose next action for the eighth step: a’ = A0

Experience tuple:

  • actual state s: S0
  • actual action a: A0 (Move Left)
  • reward r: +1
  • next state s′ = S2
  • old Q(S0, A0) = 0.500 -> from Q-table step 9
  • Q(S2, A0) = 0.000 -> from Q-table step 9
  • Q(S2, A1) = 0.815 -> from Q-table step 9
Step 10: actual state S0, next state S2, action A0
Step 10: actual state S0, next state S2, action A0

We calculate the value Q(S0, A0) for step 10

     \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(S_0, A_0) = 0.500 + 0.5 \times \big[ 1 + 0.9 \times \max(Q(S_2, A_0), Q(S_2, A_1)) - 0.500 \big] \\ \vspace{3mm} \\ \displaystyle = 0.500 + 0.5 \times ( 1 + 0.7335 - 0.500 ) = 0.500 + 0.61675 = 1.11675 \\ \vspace{2mm} \end{array} } \]

The Q-table after step 10

StateA0 (Move Left)A1 (Move Right)
S01.1161.380
S11.7100
S200.815
S300

What did the agent “learn”?

At the end of step 10, we have the final table of this example.
Each value in the table is an estimate of the “quality” of an action in a state.

Specifically, the value in the table tells how good it is to choose action a when you are in state s. The higher Q(s, a) is, the better that combination leads to an outcome.

The agent learned an optimal strategy:

  • If the pole falls to the left (S1) → push left (A0)
  • If it falls to the right (S2) → push right (A1)
  • If it is almost vertical (S0) → maintain the swing direction (A1, then probably A0 alternatively)

With each episode, the positive Qs have increased. It’s a sign that the agent is receiving constant rewards and is not falling anymore.

If the agent were now set to control the pole:

  • It would “oscillate” between S0 ↔ S1 ↔ S2, keeping the pole vertical.
  • It would never reach S3 (fall).
  • It would continue to strengthen the dominant Qs until convergence (stable values).

It should also be mentioned that the agent does not learn rules. It learn values. It does not learn what to do, but what is worth doing.

Tags: Q-Function
ShareTweetShareShareSend
Previous Post

How to Install OpenAI Gymnasium in Windows and Launch Your First Python RL Environment

Next Post

Temporal difference(TD) example for Q-Learning and DQN

Related Posts

What is Actor-Critic in Reinforcement Learning?
Deep RL Algorithms

What is Actor-Critic in Reinforcement Learning?

January 20, 2026
Exploration vs Exploitation in MDP
OpenAI Gymnasium

Exploration vs Exploitation in RL Explained with FrozenLake and DQN

February 27, 2026
Next Post
Temporal difference(TD) example for Q-Learning and DQN

Temporal difference(TD) example for Q-Learning and DQN

Discount Factor (gamma) Explained With Q-Learning + CartPole

Discount Factor (gamma) Explained With Q-Learning + CartPole

Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium

Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About the author

About Dragos Calin

Dragos Calin is a robotics engineer and reinforcement learning practitioner focused on building real-world autonomous and remote-controlled robotics for agriculture, edge-AI robotics, and embedded platforms. His work join simulation, machine learning, and hardware deployment, with a strong emphasis on practical, testable solutions that function outside the lab.

Areas of Expertise:

  • # Reinforcement Learning for Robotics
  • # Autonomous Agricultural Robots
  • # Embedded Systems & Edge AI (Jetson, Raspberry Pi, Arduino)
  • # Robotic Simulation & Sim2Real Workflow
  • # Sensor Fusion & Control Systems
  • # ROS-Based Robotics Development

Tags

Actor-Critic Bellman Equation Evaluation Metrics Exploitation Exploration Hyperparameter Tuning Machine Learning Markov Decision Process MDP MDP (Markov Decision Process) Normalization Partial Observability POMDP Q-Function Replay Buffer Temporal Difference TensorBoard
Newsletter

Subscribe Blog for Latest Updates

To stay updated with our newest projects and tutorials, make sure you subscribe to our newsletter. 

We do not share your information! You can subscribe  at any time. By subscribing you agree to our Privacy Policy.

Stay Tuned – Follow Us

To stay updated with our newest projects and tutorials, make sure you follow us on: Twitter / X

Site Information

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 Reinforcement Learning Path

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
      • CLASSIC DEEP RL APPLICATION
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3

© 2026 Reinforcement Learning Path