AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
No Result
View All Result

From MDP to POMDP: Why Reinforcement Learning Breaks in Practice

by Dragos Calin
in DQN, OpenAI Gymnasium, RL Fundamentals
3
A A
0

In this tutorial, you will learn how Reinforcement Learning really works in practice, not just in theory.

By the end, you will understand:

  • What a Markov Decision Process (MDP) really is. Not as a formula, but as the set of rules that make learning possible.
  • Why Reinforcement Learning does not start with choosing an algorithm. And why starting with PPO, SAC, or DQN is one of the most common mistakes.
  • How to correctly describe a real problem as an MDP. Including states, actions, rewards, time horizon, and termination.
  • Why most real-world problems are not true MDPs, but POMDPs. And what partial observability actually means in practice.
  • How missing information, noise, and latency break the Markov assumption. With concrete examples and visual learning curves.
  • Why RL works perfectly in simulations but struggles in the real world. And why this is an information problem, not an algorithm problem.
  • How to deliberately transform an MDP into a POMDP. By hiding variables, adding noise, and introducing delays.
  • How these changes affect learning curves and agent behavior. With side-by-side comparisons and real training results.
  • Why more training does not fix partial observability. And when learning fundamentally plateaus.
  • How practitioners “repair” POMDPs in real systems. Using frame stacking, memory, and enriched observations.

TABLE OF CONTENTS

  1. Why Reinforcement Learning Starts with an MDP
  2. The MDP Formal Framework: The Skeleton of Any RL Problem
  3. Episodic vs. Continuous Tasks + Time Horizon
  4. MDP vs POMDP: What happens when you don’t see the whole state?
  5. How to map a real problem to a POMDP (example: CartPole)
  6. Conclusion

1. Why Reinforcement Learning Starts with an MDP

no rules → no learning. rules defined → learning possible.
no rules → no learning.
rules defined → learning possible.

If we want to teach a child to play a game, before telling them how to play better, we need to explain:

  • where they are,
  • what moves they are allowed to make,
  • what happens after each move,
  • when they win and when they lose.

The Markov Decision Process (MDP) is a sheet of rules for the game.

The goal of this chapter is to teach you how to write the rules of the environment correctly, before you choose an algorithm that will learn the AI agent.

If the rules are poorly written, the AI ​​agent cannot learn, no matter how smart the algorithm is.

After this chapter, the reader understands:

  • that RL does not start with the algorithm,
  • that RL starts with the correct definition of the problem,
  • that MDP is the basis of all RL algorithms.

1.1 RL as a decision problem, not an algorithm problem

We want to apply RL to train an AI agent to solve a certain problem. This application assumes that we do not know:

  • what the RL agent is allowed to do,
  • what it is allowed to do,
  • what it means to have solved the problem.

Even if there were a way to tell the RL agent what the best strategy is to solve the problem, it would not help, because the agent does not know what problem to solve.

In Reinforcement Learning, algorithms (PPO, SAC, DQN, etc.) are just learning strategies. But in RL, we do not start with the strategy.

In RL we start with:

  • What problem is this?
  • What decisions need to be made?
  • What does a good or bad decision mean?

RL is, first and foremost, a problem of making decisions over time. It is not a problem of choosing an algorithm.

If we choose the algorithm before defining the problem, it is like buying a chess manual, while we want to play football.


1.2 The difference between “choosing an algorithm” and “defining a problem”

Defining a problem means that we need to know:

  • the state the agent is in,
  • what options it has,
  • what happens after each choice,
  • what is considered “good” and “bad.”

Choosing the algorithm means determining how the agent learns to choose better within these rules.

If the problem is not clearly defined:

  • the algorithm does not know what to optimize,
  • it does not know which decision is better,
  • it does not know what success means.

Therefore, the problem must be defined before choosing the algorithm, not the other way around.


1.3 What do the “rules of the game” mean in an RL problem

In any RL application there are always three things:

1.3.1 The agent

He is the “player”. The agent can be:

  • a robot (physical, interacting with the real world),
  • a software program (like a game bot),
  • or even a neural network model (learning complex patterns, like in AlphaGo).

The agent is the one who is taught to make decisions.

1.3.2 Environment

It is the “game world.” It can be:

  • a room,
  • a map,
  • a simulator,
  • the real world.

The environment reacts to the agent’s decisions.

1.3.3 Interaction over time

The agent:

  1. is in a situation,
  2. makes a choice,
  3. the environment changes,
  4. the agent sees the result,
  5. the process repeats.

This is a story that happens step by step, not a single decision.

Reinforcement Learning is not about a single correct answer, but about a series of decisions that influence the future.


1.4 Why an intelligent agent cannot compensate for an wrong-defined problem

If the agent is trained for a specific problem, but:

  • sometimes it solves it
  • sometimes it doesn’t solve the problem,
  • and no one knows exactly why.

The agent may be intelligent and well-trained, but:

  • it cannot learn a pattern,
  • it cannot predict what comes next,
  • it cannot get better.

In RL:

  • if the state is wrong-defined,
  • if the reward is ambiguous,
  • if the rules are inconsistent,

the agent cannot learn, even if the algorithm is very advanced. When an AI agent fails, most of the time the algorithm is not the problem. The problem is how the “game” (the problem) was defined.


1.5 What MDP promises and what it does not promise

MDP promises that:

  • the problem can be described clearly,
  • decisions can be modeled logically,
  • the agent has a framework in which it can learn.

MDP says that this problem has rules that are clear enough for an agent to try to learn.

What MDP does not promise:

  • MDP is not a solution. It does not tell you what the best decision is.
  • MDP is not an algorithm. It does not learn anything on its own.
  • MDP does not guarantee success. It only says that the problem is well-formulated, not that it is easy to solve.

In other words, MDP is the “chessboard” + the rules. MDP is not the winning strategy.


1.6 The correct order in RL engineering

In any RL project, the correct order is:

1. Define the problem: The definition assumes that we know what we want to happen.

2. Build the MDP: At this step we establish what the rules of the game are.

3. Define the reward: The definition assumes that we establish what “good” and “bad” mean.

4. Choose the algorithm: At this step we choose the algorithm depending on how and what we want the agent to learn.

This order is:

  • used in research,
  • used in robotics,
  • used in industry.

2. The MDP Formal Framework: The Skeleton of Any RL Problem

No matter what algorithm you use, there is an invisible structure that supports it
No matter what algorithm you use, there is an invisible structure that supports it

If in Chapter 1 we learned that we need rules, in this chapter we learn what correctly and completely written rules look like.

Like the skeleton of the human body:

  • you can’t see the muscles,
  • you can’t see the skin,
  • but without a skeleton, the body can’t stand.

MDP is the skeleton of any RL problem.

The goal of this chapter is to show you:

  • what pieces the MDP is made of (states, actions, rewards),
  • why all RL algorithms use the same basic rules,
  • how algorithms “think” through Bellman equations.

After this chapter, the reader will understand:

  • what RL actually optimizes,
  • why all algorithms solve the same problem, just in different ways,
  • how to read any RL algorithm and see the MDP behind it.

2.1 Formal definition of MDP (S, A, P, R, γ)

A problem has hidden parts. It is made up of “visible” elements, but it also has parts that are not visible.

In order to see a problem correctly and completely, we need to know exactly:

  • the state of the problem,
  • what the agent is allowed to do,
  • what happens after you do something,
  • whether it was a good or bad decision,
  • how much the agent should take into account what happens later.

An MDP is exactly this complete list of things, written down clearly.

  • S (State) – the state of the environment at this moment
  • A (Action) – what the agent can do
  • P (Transition) – what happens after the agent makes a certain decision
  • R (Reward) – how good or bad the action was
  • γ (Gamma) – how much the future matters

In other words, the MDP is like a “game manual“:

  • S = the positions on the board
  • A = the allowed moves
  • P = the rules of the move
  • R = the score
  • γ = whether you play only for the next move or for the end of the game

2.2 What does “Markov” mean in concrete terms

“Markov” is a complicated word for a very simple idea: if the agent knows exactly where it is in the environment, it no longer needs to know all the past.

For example, if you know the exact position of a car and its speed, it no longer matters where it was 10 seconds ago. You can make a decision about what to do next based on current information alone.

It’s like a game:

  • if you see the board exactly as it is now,
  • you don’t need to know every previous move,
  • you can decide the next move correctly.

A good state is like an intelligent summary of the past. If the summary is correct, the past is no longer needed.


2.3 Bellman as an idea, not a mathematical equation

The Bellman equation tells us that how good the agent’s environmental position is at the moment depends on:

  • what you get right away,
  • and how good the position you end up in later.

In other words, if you make a good move now, but end up in a bad position, the move wasn’t that good.

It’s like saying:

  • “It’s not just your test score that matters today,
  • it also matters whether it helps you pass the class.”

This idea is the basis of RL. That is why RL is said to be recursive optimization that assumes that each decision is related to future decisions.

INFO: For more details about the Bellman Equation, you can visit the page: Bellman Equation


2.4 Value functions (V and Q) as tools, not mathematical formulas

The agent does not know from the start what is good or bad. He needs a “map” that tells him:

  • how good a place is (V),
  • how good an action is in a place (Q).
  • V(s) says: “How good is it to be here?“
  • Q(s, a) says: “How good is it to do this here?“

It is like a video game:

  • V = how safe is an area of ​​the map.
  • Q = what move is best in that area.

These functions are not the end goal. They are tools that help the agent make better decisions.


2.5 Policy as the central object of RL

A policy is the answer to the question: “What do I do now, if I am here?“

Policy is the rule according to which the agent chooses actions. It is the final behavior of the agent.

Policy can be:

  • deterministic – always do the same thing,
  • stochastic – also try other options.

As an analogy, policy is like your “style of play“:

  • you always play defensively,
  • or sometimes take risks for a bigger win.

2.6 The connection between MDP and classes of algorithms

All RL algorithms play the “same game” (the same MDP), but the difference is that they use different strategies to learn.

There are three main families:

  • Value-based – learn how good things are
  • Policy-based – learn directly what to do
  • Actor-Critic – combine both ideas

The difference between algorithms is not “what problem they solve,” but “how they try to solve the same problem.”


3. Episodic vs. Continuous Tasks + Time Horizon

Some problems end (Episodic). Others never end (Continuous).
Some problems end (Episodic). Others never end (Continuous).

An MDP is not completely defined by (S, A, P, R, γ) alone. One essential dimension is missing: how time flows.

Some applications we want to apply RL to:

  • have a beginning and an end,
  • or never end.

RL works differently in these two cases.

The goal of this chapter is to help you understand:

  • whether the problem is a “final game” or a “game without a final one”,
  • how much the agent should care about the future,
  • why sometimes the agent behaves well in the short term but badly in the long term.

After this chapter, the reader understands:

  • what episodic vs. continuous means,
  • how gamma influences the agent’s behavior,
  • why choosing the wrong horizon can completely ruin learning.

3.1 The difference between “ending” and “never ending“

There are applications that:

  • end (the agent wins or loses),
  • never end (the agent is in control all the time).

In Reinforcement Learning:

  • an application that ends is called episodic,
  • an application that does not end is called continuous.

Examples

  • CartPole: the game ends when the pole falls. This is episodic control.
  • A walking robot: there is no “game over“, it must always go well. A continuous control.

RL must know from the start what kind of environment it is, because the rules are different.


3.2 What is an episode and why does it exist

An episode is a complete round of control, from start to finish.

At the end of an episode the environment is reset, after which the agent starts again.

Why do we need episodes?

  • to measure progress,
  • to compare performance,
  • to stop dangerous or unnecessary situations.

3.3 Terminal states

A terminal state is a point at which the episode stops. It’s like when you “lost” or “won.“

Some problems need these clear stops. Others don’t.


3.4 Time horizon as a design decision

Time horizon means “How far into the future do we look when making a decision?“

  • Short horizon: only what happens immediately matters.
  • Long horizon: what happens much later matters.

3.4.1 Why is the time horizon not arbitrary?

If chosen incorrectly:

  • the agent can learn bad tricks,
  • ignore important consequences,
  • or become unstable.

The time horizon should match “the nature of the problem“, not your preferences.


3.5 Gamma as a way of thinking about the future

The choice of Discount Factor (Gamma | γ) is related to the answer to the question: “How much do I care about what happens later?”

3.5.1 Small gamma: the agent only thinks about “now.”

Small γ is used for short-term focus applications. Gamma between 0.1–0.5. Reactive applications, with a short horizon, where the distant future is uncertain or irrelevant. The agent “reacts quickly“. Applications such as:

  • fast arcade games (Pong, Breakout): Reactions in seconds, focus on immediate score.
  • industrial control (balancing small robots, like CartPole with short times): Instant adjustments, without complex plans.
  • alert systems (real-time fraud detection): Immediate penalties for errors, not over long sessions.

3.5.2 Large gamma –> the agent thinks about “later.”

Large γ is used for long-term focus applications. Gamma between 0.95–0.99. Applications with a long horizon, where a bad decision today affects tomorrow. The agent “thinks strategically.”

Applications such as:

  • strategy games (chess, Go): Plan moves over dozens of rounds.
  • autonomous navigation (self-driving cars): Trajectories in minutes/hours, avoiding distant collisions.
  • financial optimization (investment portfolios): Maximizing profit in years, not days.

Gamma completely changes the “personality” of the agent.

INFO: To further understand the role of discount factor in RL, I recommend you to read this tutorial: Discount Factor (gamma) Explained With Q-Learning + CartPole


3.6 What happens when you choose wrong

If:

  • the episode is too short,
  • the horizon is poorly chosen,
  • the gamma does not fit,

then problems arise such as:

  • instability – the agent behaves chaotically,
  • slow learning – it seems like it is not progressing,
  • bad but stable policies – the agent always does the same wrong thing.

These problems do not come from the algorithm. They come from the way the MDP components are defined.


3.7 Comparative examples

CartPole (episodic):

  • has a beginning and an end,
  • gamma can be large,
  • the goal is clear: don’t let the pole fall.

Robot that walks (continuous):

  • has no “end“,
  • must always be stable,
  • gamma must be chosen carefully for control.

The same RL method behaves differently in the two cases.


4. MDP vs POMDP: What happens when you don’t see the whole state?

On the left you see everything → you decide easily. 
On the right you see only part → you have to guess.
On the left you see everything –> you decide easily.
On the right you see only part –> you have to guess.

If we were to play a game where:

  • there is fog on the screen,
  • we only see part of the map,
  • some information is hidden or delayed.

The game is no longer fair and clear, even if the rules are the same.

This is what happens in real life:

  • AI agents don’t see everything,
  • sensors make mistakes,
  • information comes with a delay.

This is where POMDP comes in.

The goal of this chapter is to show:

  • why almost all real problems are POMDP,
  • why the Markov hypothesis breaks down in practice,
  • what problems arise when the agent doesn’t see everything.

After this chapter, the reader understands:

  • why RL goes smoothly in perfect simulations,
  • why it becomes difficult in the real world,
  • why POMDP is the real model of the world.

First of all, I will explain some concepts as clearly as possible. These explanations will help us move more easily from theory to practice. From MDP to POMDP.

The environment = it is everything that is not the agent. Not the algorithm, not the network, not the observation.

Concrete example of who is the environment in the case of CartPole:

Environment = physical system:

  • cart
  • pole
  • gravity
  • equations of motion

State space = positions, velocities, angles.

Observation = what Gymnasium gives as input to the agent.

State space = all possible real states of the environment. It is the minimum complete information about the world that makes the future independent of the past.

Observation space = what state information the agent receives. It is just what the agent sees in that state, through sensors, possibly incomplete or noisy.

Markov property= the actual state contains all the information needed to predict the future. POMDP occurs when the observation does NOT contain enough information to be Markov.


4.1 Why complete observability is a fragile assumption

MDP assumes that:

  • the agent sees “everything that matters” about the situation they are in,
  • nothing important is hidden,
  • the information is correct and timely.

It’s like playing a game where:

  • you see the whole map,
  • you see all the pieces,
  • there are no surprises.

In reality, this rarely happens. Why? Because the real world is:

  • big,
  • complex,
  • imperfectly observable.

MDP is a convenient assumption, not a faithful description of reality.


4.2 What is partial observability, intuitively

In a POMDP, the agent does not see the complete real state, he only sees an observation.

4.2.1 Real state vs observation

  • Real state: how the world really is.
  • Observation: what the agent manages to see.

In other words, it is like driving a car at night, in fog, you only see what is illuminated by the headlights, but the road continues beyond what you see.

The agent has to make decisions without seeing everything.


4.3 Real Sources of POMDP

Partial observability occurs for very concrete reasons:

4.3.1 Noise

  • sensors are not perfect,
  • measurements are approximate.

4.3.2 Latency

  • information arrives later,
  • decisions are based on old data.

4.3.3 Discrete Sampling

  • the world is constantly changing,
  • the agent only sees it from time to time.

4.3.4 Hidden Variables

  • some things cannot be measured directly,
  • for example: internal forces, intentions, hidden states.

All of these make the agent not see the complete state.


4.4 Consequences for RL

When the agent does not see everything, it can make good decisions at one time, but bad decisions at another, even from the same observation.

This leads to:

  • unstable policies – behavior changes a lot,
  • inconsistent learning – the agent seems to learn and then “forget,”
  • need for memory – the agent must remember what it saw before.

If we were trying to play chess but were not shown the board for only a second after each move, we would most likely make bad decisions.


4.5 What is a POMDP?

A POMDP is a decision problem in which the agent does not see everything, but must guess the real state from observations.

POMDP is the correct way to think about the real world, because the real world is incompletely observable.

MDP is a useful simplification. POMDP is reality.


4.6 Why RL “works” in perfect simulations and “breaks” in reality

In simulations there is no noise by default. There is no latency. The agent sees the perfect state.

In reality the information is incomplete. Decisions are based on estimates, while errors accumulate.

This is why an agent trained in a perfect MDP can fail completely in the real world.


5. How to map a real problem to a POMDP (example: CartPole)

In this chapter. we use the same learning environment twice:

  1. In the first environment, the agent uses an MDP, meaning it sees everything;
  2. In the second environment, the agent only sees one part (POMDP);

It is the best way to understand the difference.

The goal of this chapter is to teach you:

  • what a perfect problem looks like (CartPole MDP),
  • what the same problem looks like, but realistic (CartPole POMDP),
  • how the agent’s behavior changes,
  • how we can fix the lack of information through stacking, memory or better observations.

After this chapter, the reader will know how to apply RL in practice, not just in theory.


5.1 CartPole as an ideal MDP

CartPole is one of the most used environments in Reinforcement Learning not because it is realistic, but because it is almost a perfect MDP.

This makes it ideal for learning the fundamentals of RL.


5.2 What the agent sees in CartPole

In CartPole, the agent receives at each step the complete state of the system, in the form of a numerical vector.

In the classical form, the state is:

  • the position of the cart on the X axis;
  • the speed of the cart;
  • the angle of the pole with respect to the vertical;
  • the angular velocity of the pole;

These four values ​​completely describe the physical situation of the system at that moment.

The agent:

  • knows exactly where the cart is,
  • knows exactly how fast it is moving,
  • knows exactly how much the pole is tilted,
  • knows whether the pole is falling faster or slower.

Nothing relevant is hidden.

5.2.1 Why CartPole is completely observable

A system is completely observable when:

  • the observed state contains all the information needed to predict the future,
  • if we know what action will be applied.

In CartPole:

  • the dynamics are deterministic (or nearly deterministic),
  • there are no hidden internal variables,
  • there are no noisy sensors,
  • there is no latency between measurement and action.

If the agent sees the same state and applies the same action, the future evolution will be the same (within numerical errors).

5.2.2 Connection with the Markov property

CartPole perfectly respects the Markov property, namely that the future depends only on the current state and the current action, not on the complete history.

Therefore CartPole is a valid MDP, not a POMDP.


5.3 Why the past doesn’t matter anymore in CartPole

A very important point for understanding RL is that in CartPole:

  • you don’t need to know what happened 10 steps ago,
  • you don’t need to memorize the history,
  • you don’t need Long Short-Term Memory(LSTM) or other forms of memory.

Because the position and velocity are a complete summary of the cart’s motion. Also, the angle and angular velocity are a complete summary of the pole’s motion.

It’s like seeing the exact position and velocity of a ball. It doesn’t matter how it was thrown anymore. You can calculate exactly where it will end up.


5.4 Why CartPole is an “easy” problem for RL

CartPole is considered “easy” for several fundamental reasons:

5.4.1 The state space is small

CartPole has only 4 dimensions with well-scaled values and no redundancy.

5.4.2 The action space is discrete and small

  • only two actions: left / right,
  • there is no fine, continuous control.

5.4.3 The reward is dense and clear

  • the agent receives +1 for each step in which the pole does not fall,
  • the goal is simple: maintain balance as long as possible.

5.4.4 There is no partial observability

  • the agent sees everything that matters,
  • it does not have to guess anything.

5.4.5 The episode is clearly defined

  • it starts in a standard initial state,
  • it ends when the pole falls or the cart goes out of bounds.

All this makes RL algorithms learn quickly and stably.


5.5 Interpretation of The Graph: CartPole in MDP Mode (fully observable) and Demo Results

CartPole in MDP mode (fully observable)
CartPole in MDP mode (fully observable)

The graph represents the evolution of rollout/ep_rew_mean over time, i.e.:

  • the average reward per episode,
  • measured over successive training windows,
  • as a function of the total number of steps (timesteps).

In CartPole, the reward is:

  • +1 for each step in which the pole does not fall,
  • in this case, ep_rew_mean is directly proportional to the average duration of the episode.

A score of:

  • ~200 –> the agent maintains balance for ~200 steps,
  • ~500 –> almost maximum episode (CartPole-v1 has a limit of ≈ 500).

Graphic conclusion

This graph clearly demonstrates that:

  • CartPole is an excellent MDP,
  • the state is fully observable,
  • no memory or history is needed.

DQN works well when the Markov assumption is respected. It learns stably, and converges to a near-optimal policy.

Oscillations do not indicate a modeling problem. They are normal in DQN and arise from exploration and off-policy updates.

This is the correct baseline. Any subsequent degradation (noise, delay) can be directly compared to this behavior.

Demo Results — CartPole as an Ideal MDP

Demo Results — CartPole as an Ideal MDP
Demo Results — CartPole as an Ideal MDP

In the evaluation phase, the trained agent achieves a reward of 500 in two of three demo episodes.

This means that:

  • the pole remains balanced for the maximum possible duration,
  • the policy behaves consistently across episodes,
  • no instability or degradation appears during execution.

Because CartPole is a fully observable MDP, the agent does not rely on memory or past observations. Each decision is made solely based on the current state, which already contains all the information required for optimal control.

The perfect score across multiple episodes confirms that:

  • the learning process has converged,
  • the learned policy is stable,
  • the environment formulation as an MDP is correct.

This result serves as a clean baseline for later comparison with partially observable versions of the same task.


5.6 Deliberately transforming an MDP into a POMDP

Up until now, our agent has been “playing a game” perfectly correctly:

  • seeing everything that happens,
  • seeing the correct information,
  • seeing the information in time.

This is an ideal MDP. In the real world, the “game” is not like that.

To understand what happens in practice, we intentionally spoil the environment, step by step.

Not because we want to cheat, but because we want to demonstrate what the “real world looks like.” This “controlled alteration” is called transforming into a POMDP.

5.6.1 Removing some variables

At this step, the agent no longer sees everything.

It’s like playing a game where:

  • you see where the ball is,
  • but you don’t see how fast it’s moving.

You can guess what’s coming next, but you don’t know for sure.

This is called partial observability.

What we do specifically

In CartPole, before, the agent saw:

  • position,
  • speed,
  • angle,
  • angular velocity.

Now, we hide some information from it –> for example velocity.

The agent:

  • only sees where it is,
  • no longer sees how it moves.

What changes

The same observation can come from:

  • a slow movement,
  • or a fast movement.

The agent no longer knows exactly what is coming.

Difference between the two graphs

CartPole as MDP vs CartPole without velocity
CartPole as MDP vs CartPole without velocity

Orange(MDP): “I know exactly what is happening now and what is coming next.”

Blue (POMDP): “I see something, but I don’t know if it is dangerous or not.”

Therefore:

  • orange goes up,
  • blue gets stuck at the bottom.

Why is the blue line not going up?

Very important:

  • It is not the algorithm’s fault,
  • It is not the number of steps’ fault.

The problem is that:

  • missing information cannot be invented,
  • without velocity, the future cannot be predicted correctly.

It is like:

  • driving a car,
  • you see the road,
  • but you don’t see the speed on the dashboard.

Lesson of this graph

If the agent doesn’t see everything that matters, learning stops early.

MDP (orange):

  • fast learning,
  • stable,
  • almost perfect.

POMDP (blue):

  • slow learning,
  • limited,
  • with a clear ceiling.

When velocity is hidden, the agent can no longer predict the future correctly. Even with long training, performance remains limited, not because the algorithm is weak, but because the information is incomplete.

Demo Results — CartPole with Missing Variables (Hidden Velocity)

CartPole with Missing Variables (Hidden Velocity)
CartPole with Missing Variables (Hidden Velocity)

In the demo phase, the agent achieves rewards of 44, 46, and 32 across three episodes.

This behavior is expected. Because velocity information is hidden, the agent no longer has access to the full state of the system.

Different real situations can look identical from the agent’s point of view, making precise control impossible.

As a result:

  • the policy remains unstable,
  • performance varies significantly between episodes,
  • the agent can no longer consistently maintain balance.

This demonstrates that:

  • the limitation does not come from the algorithm,
  • but from missing information in the observation.

The agent can still act, but it must guess the future instead of predicting it.

5.6.2 Adding noise

The agent sees what is happening, but sees it wrong. To imagine what the agent sees, we can imagine that:

  • you look through a dirty window,
  • you see the position of things,
  • but not very clearly.

Sometimes you see it right, sometimes you see it wrong.

What we do specifically in the Python program:

Instead of giving the agent:

position = 0.25

we give it:

position = 0.25 + noise

This noise:

  • is small,
  • random,
  • different from step to step.

What changes

The agent:

  • no longer receives the same observation for the same real state,
  • cannot be sure what is really happening.

Even if the world is the same, the information is imperfect.

Difference between the two graphs

If we looked at the two graphs like two children trying to balance a pole, we would have:

The orange child (MDP) sees the pole clearly, without any problems.

The red child (POMDP with noise) sees the pole, but the image shakes a little all the time.

They both train equally hard.

What the ORANGE line does (MDP, without noise)

The orange agent:

  • sees exactly where the pole is,
  • learns better and better,
  • gets to the very top at the end.

Everything is clear and stable.

What the RED line does (noise)

The red agent:

  • sees the pole,
  • but the position moves a little with each blink,
  • doesn’t know if the pole is really moving or if it’s just the image.

MDP (without noise): “What I see is real.”

POMDP (with noise): “What I see might be a little wrong.”

That little mistake:

  • adds up,
  • produces wrong moves,
  • breaks stability.

What does this graph teach us? Even a little noise makes RL much harder.

The agent:

  • can still learn,
  • but slower,
  • more unstable,
  • and with a clear ceiling.

The connection to the real world

In real life:

  • sensors are not perfect,
  • measurements are noisy,
  • data is never “clean.”

This graph shows why RL works perfectly in simulations, but it requires a lot of care in reality.

Demo Results — CartPole with Noisy Observations

CartPole with Noisy Observations
CartPole with Noisy Observations

In the demo, the agent gets rewards of 185, 132, and 127.

This means:

  • sometimes the agent manages to keep the pole balanced for a while,
  • sometimes it fails much earlier,
  • the result changes from one episode to another.

Why does this happen?

Because the agent sees the world with noise. What it observes is a little bit wrong every time, so it sometimes reacts correctly and sometimes reacts too late or too much.

The agent is not bad or broken. It is doing its best with imperfect information.

The noise is part of the environment, not part of the model. To evaluate the policy correctly, the same observation noise must be applied during both training and testing.

5.6.3 Introducing latency

At this step, the agent sees the past, not the present.

This is the case where the agent sees what happened a second ago, but you have to decide what to do now. It is like playing a game with delay.

What we do specifically

Instead of giving the agent the current state, we give it the state it was 1-2 steps ago. The agent will react to a world that no longer exists at that exact moment.

What changes

Even if the observation is correct, it is old. The agent has to guess what happened in the meantime.

Difference between the two graphs

Graph for CartPole MDP and CartPole with Latency
Graph for CartPole MDP and CartPole with Latency

Imagine trying to catch a ball, but you see where the ball was a second ago, not where it is now.

Orange line (MDP): sees everything in time –> reacts correctly –> learns well.

Blue line (latency): sees the past –> reacts too late –> makes wrong moves.

Therefore:

  • the blue line rises a little,
  • then falls,
  • then oscillates,
  • but never reaches the top.

When the agent sees the world too late, it reacts to the past instead of the present, so learning becomes unstable and limited.

Demo Results — CartPole with Latency

Demo : CartPole with Latency
Demo : CartPole with Latency

In the demo, the agent gets rewards of 12, 28, and 88.

This means:

  • sometimes the agent fails almost immediately,
  • sometimes it manages to balance the pole a little longer,
  • but the result changes a lot from one episode to another.

Why does this happen?

Because the agent sees the world too late. It reacts to where the pole was, not where it is now.

So:

  • sometimes it moves too late,
  • sometimes it moves too much,
  • sometimes it gets lucky for a few steps.

5.6.4 Fully Realistic POMDP

CartPole MDP vs CartPole Fully Realistic POMDP
CartPole MDP vs CartPole Fully Realistic POMDP

In this setting, the agent faces all the problems at the same time:

  • it cannot see everything (missing variables),
  • what it sees is a little wrong (noise),
  • and what it sees arrives too late (latency).

It is like:

  • playing a game,
  • where part of the screen is hidden,
  • the image is blurry,
  • and the screen is delayed.

The agent is not bad or broken. It is simply trying to make decisions with very limited and imperfect information.

When information is missing, noisy, and delayed at the same time, learning almost stops, no matter how long the agent is trained.

Not because:

  • the agent is stupid,
  • the algorithm is wrong,
  • it doesn’t have enough steps.

But because it no longer knows what’s really going on.

Why is this graph so important?

It clearly shows why RL works perfectly in simulations and fails in reality. It also shows us why the problem is not the algorithm, but the information.

Demo Results — Fully Realistic POMDP

Demo: CartPole Fully Realistic POMDP
Demo: CartPole Fully Realistic POMDP

In the demo, the agent gets rewards of 10, 12, and 17.

This means:

  • the pole falls almost immediately,
  • the agent barely has time to react,
  • performance is very low and similar in all episodes.

Why does this happen?

Because the agent:

  • does not see everything (missing velocity),
  • sees wrong information (noise),
  • sees it too late (latency).

It is like trying to balance a pole with one eye closed while the image is blurry and everything you see is already from the past.

This result is not a failure. It shows that:

  • the algorithm is not the problem,
  • more training would not fix it,
  • the agent simply does not have enough reliable information to act correctly.

5.7 From MDP to POMDP: CartPole with Hidden State, Noise, and Latency

This script is designed to train and evaluate a DQN agent on CartPole while explicitly showing the difference between an ideal MDP and increasingly realistic POMDP settings.

Instead of changing the algorithm, the script keeps the same DQN setup and progressively changes what the agent can observe.

You can run CartPole in:

  • a pure MDP setting, where the agent sees the full state,
  • or several POMDP variants, where information is missing, noisy, or delayed.

The goal of this script is not to achieve the best possible score, but to demonstrate how partial observability alone affects learning and behavior.

By toggling simple command-line flags, you can reproduce:

  • hidden state variables,
  • noisy observations,
  • observation latency,
  • or all three combined.

This makes the script a controlled experiment that illustrates why Reinforcement Learning often works well in simulations but struggles in real-world systems.

"""
DQN Training and Demo Script for CartPole
-----------------------------------------------
Train or test a DQN agent using Stable Baselines3.

Supports:
- Pure MDP (baseline)
- Partial observability (POMDP) via:
  1. Hidden variables
  2. Observation noise
  3. Observation latency

TRAINING EXAMPLES:
    python dqn_cartpole.py --train
    python dqn_cartpole.py --train --hide_velocity
    python dqn_cartpole.py --train --obs_noise_std 0.02
    python dqn_cartpole.py --train --obs_delay 2
    python dqn_cartpole.py --train --hide_velocity --obs_noise_std 0.02 --obs_delay 2

DEMO EXAMPLE:
    python dqn_cartpole.py --demo --model YOUR_MODEL_PATH.zip
    python dqn_cartpole.py --demo --model YOUR_MODEL_PATH.zip --hide_velocity
    python dqn_cartpole.py --demo --model YOUR_MODEL_PATH.zip --obs_noise_std 0.02
    python dqn_cartpole.py --demo --model YOUR_MODEL_PATH.zip --hide_velocity --obs_noise_std 0.02 --obs_delay 2

Author: Calin Dragos George
"""

import argparse
import os
import time
from collections import deque

import gymnasium as gym
import numpy as np
import torch

from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure


# ---------------------------------------------------------
# Utility: Set seeds for reproducibility
# ---------------------------------------------------------
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


# ---------------------------------------------------------
# Observation Wrappers (POMDP transformations)
# ---------------------------------------------------------
class HideVelocityWrapper(gym.ObservationWrapper):
    """
        * removes velocity components from CartPole observation
        * keeps only position and angle
    """
    def __init__(self, env):
        super().__init__(env)
        self.observation_space = gym.spaces.Box(
            low=np.array([-4.8, -0.418], dtype=np.float32),
            high=np.array([4.8, 0.418], dtype=np.float32),
        )

    def observation(self, obs):
        # obs = [x, x_dot, theta, theta_dot]
        return np.array([obs[0], obs[2]], dtype=np.float32)


class AddNoiseWrapper(gym.ObservationWrapper):
    """
        * adds Gaussian noise to observations
    """
    def __init__(self, env, noise_std):
        super().__init__(env)
        self.noise_std = noise_std

    def observation(self, obs):
        noise = np.random.normal(0.0, self.noise_std, size=obs.shape)
        return obs + noise


class ObservationDelayWrapper(gym.ObservationWrapper):
    """
        * returns observations from k steps in the past
    """
    def __init__(self, env, delay_steps):
        super().__init__(env)
        self.delay_steps = delay_steps
        self.buffer = deque(maxlen=delay_steps + 1)

    def reset(self, **kwargs):
        obs, info = self.env.reset(**kwargs)
        self.buffer.clear()
        for _ in range(self.delay_steps + 1):
            self.buffer.append(obs)
        return obs, info

    def observation(self, obs):
        self.buffer.append(obs)
        return self.buffer[0]


# ---------------------------------------------------------
# Create CartPole Environment
# ---------------------------------------------------------
def make_env(
    render=False,
    hide_velocity=False,
    noise_std=0.0,
    delay_steps=0,
):
    if render:
        env = gym.make("CartPole-v1", render_mode="human")
    else:
        env = gym.make("CartPole-v1")

    # Apply POMDP transformations
    if hide_velocity:
        env = HideVelocityWrapper(env)

    if noise_std > 0.0:
        env = AddNoiseWrapper(env, noise_std)

    if delay_steps > 0:
        env = ObservationDelayWrapper(env, delay_steps)

    env = Monitor(env)
    return env


# ---------------------------------------------------------
# Create DQN Model
# ---------------------------------------------------------
def create_model(env, lr, log_dir):
    model = DQN(
        "MlpPolicy",
        env,
        learning_rate=lr,
        buffer_size=100_000,
        learning_starts=1_000,
        batch_size=64,
        gamma=0.99,
        target_update_interval=1_000,
        train_freq=4,
        exploration_fraction=0.1,
        exploration_final_eps=0.02,
        verbose=1,
        tensorboard_log=log_dir,
    )

    logger = configure(log_dir, ["stdout", "tensorboard"])
    model.set_logger(logger)

    model.logger.record("hyperparams/learning_rate", lr)
    model.logger.record("hyperparams/buffer_size", 100_000)
    model.logger.record("hyperparams/batch_size", 64)

    return model

# ---------------------------------------------------------
# ...for saving the model
# ---------------------------------------------------------
def build_experiment_tag(hide_velocity, noise_std, delay_steps):
    tags = []

    if hide_velocity:
        tags.append("noVel")

    if noise_std > 0.0:
        tags.append(f"noise{noise_std}")

    if delay_steps > 0:
        tags.append(f"delay{delay_steps}")

    if not tags:
        return "MDP"

    return "_".join(tags)

# ---------------------------------------------------------
# Train DQN
# ---------------------------------------------------------
def train_dqn(lr, timesteps, seed, hide_velocity, noise_std, delay_steps):
    set_seed(seed)

    env = make_env(
        hide_velocity=hide_velocity,
        noise_std=noise_std,
        delay_steps=delay_steps,
    )
    exp_tag = build_experiment_tag(hide_velocity, noise_std, delay_steps)

    timestamp = time.strftime("%Y%m%d-%H%M%S")
    log_dir = f"logs/DQN_CartPole_{timestamp}"
    os.makedirs(log_dir, exist_ok=True)

    print("\nTraining DQN on CartPole-v1")
    print(f"LR: {lr} | Seed: {seed}")
    print(f"hide_velocity={hide_velocity}, noise_std={noise_std}, delay={delay_steps}")
    print(f"Logging to: {log_dir}\n")

    model = create_model(env, lr, log_dir)
    model.learn(total_timesteps=timesteps)

    model_path = os.path.join(log_dir, f"DQN_CartPole_{exp_tag}.zip")
    model.save(model_path)

    print(f"\nModel saved to: {model_path}\n")
    env.close()


# ---------------------------------------------------------
# Demo DQN Model
# ---------------------------------------------------------
def run_demo(model_path, episodes, hide_velocity, noise_std, delay_steps):
    if not os.path.exists(model_path):
        print(f"\nModel not found: {model_path}\n")
        return

    env = make_env(
        render=True,
        hide_velocity=hide_velocity,
        noise_std=noise_std,
        delay_steps=delay_steps,
    )

    model = DQN.load(model_path)

    for ep in range(episodes):
        obs, _ = env.reset()
        done = False
        total_reward = 0

        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated

        print(f"Episode {ep + 1}: Reward = {total_reward}")

    env.close()


# ---------------------------------------------------------
# CLI
# ---------------------------------------------------------
if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("--train", action="store_true")
    parser.add_argument("--demo", action="store_true")

    parser.add_argument("--lr", type=float, default=3e-4)
    parser.add_argument("--timesteps", type=int, default=200_000)
    parser.add_argument("--seed", type=int, default=1)

    parser.add_argument("--model", type=str, default=None)
    parser.add_argument("--episodes", type=int, default=3)

    # POMDP flags
    parser.add_argument("--hide_velocity", action="store_true")
    parser.add_argument("--obs_noise_std", type=float, default=0.0)
    parser.add_argument("--obs_delay", type=int, default=0)

    args = parser.parse_args()

    if args.train:
        train_dqn(
            args.lr,
            args.timesteps,
            args.seed,
            args.hide_velocity,
            args.obs_noise_std,
            args.obs_delay,
        )

    elif args.demo:
        if args.model is None:
            print("\nERROR: missing --model path\n")
        else:
            run_demo(
                args.model,
                args.episodes,
                args.hide_velocity,
                args.obs_noise_std,
                args.obs_delay,
            )

    else:
        print("\nPlease specify --train or --demo.\n")

5.8 How to fix a POMDP

Reinforcement Learning works not because the world is Markovian, but because we make it Markovian enough.

In the previous sections we saw that when:

  • important variables are missing,
  • observations are noisy,
  • information arrives late,
  • reinforcement learning degrades or even locks up.

This is not an algorithm problem. It is an information problem. This section explains the three standard ways in which, in practice, a POMDP is transformed into a problem that is “sufficiently Markovian” for RL to work.

5.8.1 Frame Stacking – reconstructing the recent past

In a POMDP, a single observation is not enough. But several consecutive observations can tell a story.

For example:

  • you don’t see the velocity,
  • but if you see the current position and the position 1 step ago,
  • you can infer the direction and approximate velocity.

Frame stacking does exactly this.

When it works well

  • lacks velocities or derivatives
  • has low latency
  • dynamics are relatively slow

Limitations

  • increases state size
  • does not scale well for complex dynamics
  • does not “understand” the past, just concatenates it

Frame stacking is a simple and effective solution, but not smart.

5.8.2 Memory – learning an internal state

Instead of giving the agent the raw past, you let it remember what matters.

It is the difference between being given a log with the last 5 pages, and having a memory that retains the important things.

What does it mean technically

Memory architectures are used, such as:

  • LSTM
  • GRU
  • Recurrent Policies (DRQN, Recurrent PPO etc.)

The agent:

  • receives observations one by one,
  • updates an internal state,
  • uses this state for decision making.

What problem does it solve:

  • severe partial observability
  • high latency
  • persistent noise
  • completely hidden variables

Memory allows the agent to:

  • filter out noise,
  • integrate information over time,
  • reconstruct the internal dynamics of the environment.

Limitations

  • more unstable training
  • harder to debug
  • requires more data
  • more complex implementation

Memory is the most powerful solution, but also the most expensive.

5.8.3 Enriched observations – adding useful information

Sometimes, the problem is not that RL doesn’t work, but that the agent doesn’t get what it needs.

Instead of fixing the algorithm, fix what the agent observes.

What it means technically

Add to the observation:

  • Estimates (e.g. estimated velocity),
  • Filters (e.g. averages, derivations),
  • Additional sensors,
  • Computed state information.

Examples:

  • IMU + encoder in robotics,
  • Kalman filters,
  • Derived signals,
  • Timestamps.

When is the best solution

  • You have control over the sensors,
  • You can pre-process the data,
  • You want stability and interpretability.

This is the dominant approach in real robotics.

Limitations

  • requires domain expertise,
  • may introduce bias,
  • is not end-to-end.

Enriched observations are the most pragmatic solution.

5.8.4 How to choose the right solution

ProblemRecommended Solution
 Missing speedsFrame stacking
Noise + latencyMemory
Real sensorsEnriched observations
Simple systemStacking
Complex system Memory + observations

You don’t fix a POMDP by making the algorithm smarter. You fix it by giving the agent a better state.


6. Conclusion

Reinforcement Learning is not about smart algorithms. It is about what the agent sees, how correctly it sees, and how early it sees.

If the agent sees everything clearly, then we have an MDP and RL works great. If the agent sees partially, with noise and delay, then we have a POMDP and RL crashes.

The algorithm is not “stupid”. The agent is not “weak”. The problem is the information.

The most important lesson is that you do not make RL better by choosing a more complicated algorithm. You make it better by giving the agent a better state.

That is why in practice we add memory, use frame stacking, enrich observations, filter and estimate the real state.

Not because the theory is wrong, but because the real world is not a perfect MDP.

In theory everything is MDP. In practice almost everything is POMDP.

Tags: Bellman EquationMarkov Decision ProcessMDPPartial ObservabilityPOMDP
ShareTweetShareShareSend
Previous Post

Reinforcement Learning: Supervised, Unsupervised, or Something Else? (When to Use Each)

Next Post

Exploration vs Exploitation in RL Explained with FrozenLake and DQN

Related Posts

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows
MuJoCo

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows

March 4, 2026
What is Actor-Critic in Reinforcement Learning?
Deep RL Algorithms

What is Actor-Critic in Reinforcement Learning?

January 20, 2026
Next Post
Exploration vs Exploitation in MDP

Exploration vs Exploitation in RL Explained with FrozenLake and DQN

What is Actor-Critic in Reinforcement Learning?

What is Actor-Critic in Reinforcement Learning?

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About the author

About Dragos Calin

Dragos Calin is a robotics engineer and reinforcement learning practitioner focused on building real-world autonomous and remote-controlled robotics for agriculture, edge-AI robotics, and embedded platforms. His work join simulation, machine learning, and hardware deployment, with a strong emphasis on practical, testable solutions that function outside the lab.

Areas of Expertise:

  • # Reinforcement Learning for Robotics
  • # Autonomous Agricultural Robots
  • # Embedded Systems & Edge AI (Jetson, Raspberry Pi, Arduino)
  • # Robotic Simulation & Sim2Real Workflow
  • # Sensor Fusion & Control Systems
  • # ROS-Based Robotics Development

Tags

Actor-Critic Bellman Equation Evaluation Metrics Exploitation Exploration Hyperparameter Tuning Machine Learning Markov Decision Process MDP MDP (Markov Decision Process) Normalization Partial Observability POMDP Q-Function Replay Buffer Temporal Difference TensorBoard
Newsletter

Subscribe Blog for Latest Updates

To stay updated with our newest projects and tutorials, make sure you subscribe to our newsletter. 

We do not share your information! You can subscribe  at any time. By subscribing you agree to our Privacy Policy.

Stay Tuned – Follow Us

To stay updated with our newest projects and tutorials, make sure you follow us on: Twitter / X

Site Information

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 Reinforcement Learning Path

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
      • CLASSIC DEEP RL APPLICATION
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3

© 2026 Reinforcement Learning Path