In this tutorial, you will learn how Reinforcement Learning really works in practice, not just in theory.
By the end, you will understand:
- What a Markov Decision Process (MDP) really is. Not as a formula, but as the set of rules that make learning possible.
- Why Reinforcement Learning does not start with choosing an algorithm. And why starting with PPO, SAC, or DQN is one of the most common mistakes.
- How to correctly describe a real problem as an MDP. Including states, actions, rewards, time horizon, and termination.
- Why most real-world problems are not true MDPs, but POMDPs. And what partial observability actually means in practice.
- How missing information, noise, and latency break the Markov assumption. With concrete examples and visual learning curves.
- Why RL works perfectly in simulations but struggles in the real world. And why this is an information problem, not an algorithm problem.
- How to deliberately transform an MDP into a POMDP. By hiding variables, adding noise, and introducing delays.
- How these changes affect learning curves and agent behavior. With side-by-side comparisons and real training results.
- Why more training does not fix partial observability. And when learning fundamentally plateaus.
- How practitioners “repair” POMDPs in real systems. Using frame stacking, memory, and enriched observations.
TABLE OF CONTENTS
- Why Reinforcement Learning Starts with an MDP
- The MDP Formal Framework: The Skeleton of Any RL Problem
- Episodic vs. Continuous Tasks + Time Horizon
- MDP vs POMDP: What happens when you don’t see the whole state?
- How to map a real problem to a POMDP (example: CartPole)
- Conclusion
1. Why Reinforcement Learning Starts with an MDP

rules defined → learning possible.
If we want to teach a child to play a game, before telling them how to play better, we need to explain:
- where they are,
- what moves they are allowed to make,
- what happens after each move,
- when they win and when they lose.
The Markov Decision Process (MDP) is a sheet of rules for the game.
The goal of this chapter is to teach you how to write the rules of the environment correctly, before you choose an algorithm that will learn the AI agent.
If the rules are poorly written, the AI agent cannot learn, no matter how smart the algorithm is.
After this chapter, the reader understands:
- that RL does not start with the algorithm,
- that RL starts with the correct definition of the problem,
- that MDP is the basis of all RL algorithms.
1.1 RL as a decision problem, not an algorithm problem
We want to apply RL to train an AI agent to solve a certain problem. This application assumes that we do not know:
- what the RL agent is allowed to do,
- what it is allowed to do,
- what it means to have solved the problem.
Even if there were a way to tell the RL agent what the best strategy is to solve the problem, it would not help, because the agent does not know what problem to solve.
In Reinforcement Learning, algorithms (PPO, SAC, DQN, etc.) are just learning strategies. But in RL, we do not start with the strategy.
In RL we start with:
- What problem is this?
- What decisions need to be made?
- What does a good or bad decision mean?
RL is, first and foremost, a problem of making decisions over time. It is not a problem of choosing an algorithm.
If we choose the algorithm before defining the problem, it is like buying a chess manual, while we want to play football.
1.2 The difference between “choosing an algorithm” and “defining a problem”
Defining a problem means that we need to know:
- the state the agent is in,
- what options it has,
- what happens after each choice,
- what is considered “good” and “bad.”
Choosing the algorithm means determining how the agent learns to choose better within these rules.
If the problem is not clearly defined:
- the algorithm does not know what to optimize,
- it does not know which decision is better,
- it does not know what success means.
Therefore, the problem must be defined before choosing the algorithm, not the other way around.
1.3 What do the “rules of the game” mean in an RL problem
In any RL application there are always three things:
1.3.1 The agent
He is the “player”. The agent can be:
- a robot (physical, interacting with the real world),
- a software program (like a game bot),
- or even a neural network model (learning complex patterns, like in AlphaGo).
The agent is the one who is taught to make decisions.
1.3.2 Environment
It is the “game world.” It can be:
- a room,
- a map,
- a simulator,
- the real world.
The environment reacts to the agent’s decisions.
1.3.3 Interaction over time
The agent:
- is in a situation,
- makes a choice,
- the environment changes,
- the agent sees the result,
- the process repeats.
This is a story that happens step by step, not a single decision.
Reinforcement Learning is not about a single correct answer, but about a series of decisions that influence the future.
1.4 Why an intelligent agent cannot compensate for an wrong-defined problem
If the agent is trained for a specific problem, but:
- sometimes it solves it
- sometimes it doesn’t solve the problem,
- and no one knows exactly why.
The agent may be intelligent and well-trained, but:
- it cannot learn a pattern,
- it cannot predict what comes next,
- it cannot get better.
In RL:
- if the state is wrong-defined,
- if the reward is ambiguous,
- if the rules are inconsistent,
the agent cannot learn, even if the algorithm is very advanced. When an AI agent fails, most of the time the algorithm is not the problem. The problem is how the “game” (the problem) was defined.
1.5 What MDP promises and what it does not promise
MDP promises that:
- the problem can be described clearly,
- decisions can be modeled logically,
- the agent has a framework in which it can learn.
MDP says that this problem has rules that are clear enough for an agent to try to learn.
What MDP does not promise:
- MDP is not a solution. It does not tell you what the best decision is.
- MDP is not an algorithm. It does not learn anything on its own.
- MDP does not guarantee success. It only says that the problem is well-formulated, not that it is easy to solve.
In other words, MDP is the “chessboard” + the rules. MDP is not the winning strategy.
1.6 The correct order in RL engineering
In any RL project, the correct order is:
1. Define the problem: The definition assumes that we know what we want to happen.
2. Build the MDP: At this step we establish what the rules of the game are.
3. Define the reward: The definition assumes that we establish what “good” and “bad” mean.
4. Choose the algorithm: At this step we choose the algorithm depending on how and what we want the agent to learn.
This order is:
- used in research,
- used in robotics,
- used in industry.
2. The MDP Formal Framework: The Skeleton of Any RL Problem

If in Chapter 1 we learned that we need rules, in this chapter we learn what correctly and completely written rules look like.
Like the skeleton of the human body:
- you can’t see the muscles,
- you can’t see the skin,
- but without a skeleton, the body can’t stand.
MDP is the skeleton of any RL problem.
The goal of this chapter is to show you:
- what pieces the MDP is made of (states, actions, rewards),
- why all RL algorithms use the same basic rules,
- how algorithms “think” through Bellman equations.
After this chapter, the reader will understand:
- what RL actually optimizes,
- why all algorithms solve the same problem, just in different ways,
- how to read any RL algorithm and see the MDP behind it.
2.1 Formal definition of MDP (S, A, P, R, γ)
A problem has hidden parts. It is made up of “visible” elements, but it also has parts that are not visible.
In order to see a problem correctly and completely, we need to know exactly:
- the state of the problem,
- what the agent is allowed to do,
- what happens after you do something,
- whether it was a good or bad decision,
- how much the agent should take into account what happens later.
An MDP is exactly this complete list of things, written down clearly.
- S (State) – the state of the environment at this moment
- A (Action) – what the agent can do
- P (Transition) – what happens after the agent makes a certain decision
- R (Reward) – how good or bad the action was
- γ (Gamma) – how much the future matters
In other words, the MDP is like a “game manual“:
- S = the positions on the board
- A = the allowed moves
- P = the rules of the move
- R = the score
- γ = whether you play only for the next move or for the end of the game
2.2 What does “Markov” mean in concrete terms
“Markov” is a complicated word for a very simple idea: if the agent knows exactly where it is in the environment, it no longer needs to know all the past.
For example, if you know the exact position of a car and its speed, it no longer matters where it was 10 seconds ago. You can make a decision about what to do next based on current information alone.
It’s like a game:
- if you see the board exactly as it is now,
- you don’t need to know every previous move,
- you can decide the next move correctly.
A good state is like an intelligent summary of the past. If the summary is correct, the past is no longer needed.
2.3 Bellman as an idea, not a mathematical equation
The Bellman equation tells us that how good the agent’s environmental position is at the moment depends on:
- what you get right away,
- and how good the position you end up in later.
In other words, if you make a good move now, but end up in a bad position, the move wasn’t that good.
It’s like saying:
- “It’s not just your test score that matters today,
- it also matters whether it helps you pass the class.”
This idea is the basis of RL. That is why RL is said to be recursive optimization that assumes that each decision is related to future decisions.
INFO: For more details about the Bellman Equation, you can visit the page: Bellman Equation
2.4 Value functions (V and Q) as tools, not mathematical formulas
The agent does not know from the start what is good or bad. He needs a “map” that tells him:
- how good a place is (V),
- how good an action is in a place (Q).
- V(s) says: “How good is it to be here?“
- Q(s, a) says: “How good is it to do this here?“
It is like a video game:
- V = how safe is an area of the map.
- Q = what move is best in that area.
These functions are not the end goal. They are tools that help the agent make better decisions.
2.5 Policy as the central object of RL
A policy is the answer to the question: “What do I do now, if I am here?“
Policy is the rule according to which the agent chooses actions. It is the final behavior of the agent.
Policy can be:
- deterministic – always do the same thing,
- stochastic – also try other options.
As an analogy, policy is like your “style of play“:
- you always play defensively,
- or sometimes take risks for a bigger win.
2.6 The connection between MDP and classes of algorithms
All RL algorithms play the “same game” (the same MDP), but the difference is that they use different strategies to learn.
There are three main families:
- Value-based – learn how good things are
- Policy-based – learn directly what to do
- Actor-Critic – combine both ideas
The difference between algorithms is not “what problem they solve,” but “how they try to solve the same problem.”
3. Episodic vs. Continuous Tasks + Time Horizon

An MDP is not completely defined by (S, A, P, R, γ) alone. One essential dimension is missing: how time flows.
Some applications we want to apply RL to:
- have a beginning and an end,
- or never end.
RL works differently in these two cases.
The goal of this chapter is to help you understand:
- whether the problem is a “final game” or a “game without a final one”,
- how much the agent should care about the future,
- why sometimes the agent behaves well in the short term but badly in the long term.
After this chapter, the reader understands:
- what episodic vs. continuous means,
- how gamma influences the agent’s behavior,
- why choosing the wrong horizon can completely ruin learning.
3.1 The difference between “ending” and “never ending“
There are applications that:
- end (the agent wins or loses),
- never end (the agent is in control all the time).
In Reinforcement Learning:
- an application that ends is called episodic,
- an application that does not end is called continuous.
Examples
- CartPole: the game ends when the pole falls. This is episodic control.
- A walking robot: there is no “game over“, it must always go well. A continuous control.
RL must know from the start what kind of environment it is, because the rules are different.
3.2 What is an episode and why does it exist
An episode is a complete round of control, from start to finish.
At the end of an episode the environment is reset, after which the agent starts again.
Why do we need episodes?
- to measure progress,
- to compare performance,
- to stop dangerous or unnecessary situations.
3.3 Terminal states
A terminal state is a point at which the episode stops. It’s like when you “lost” or “won.“
Some problems need these clear stops. Others don’t.
3.4 Time horizon as a design decision
Time horizon means “How far into the future do we look when making a decision?“
- Short horizon: only what happens immediately matters.
- Long horizon: what happens much later matters.
3.4.1 Why is the time horizon not arbitrary?
If chosen incorrectly:
- the agent can learn bad tricks,
- ignore important consequences,
- or become unstable.
The time horizon should match “the nature of the problem“, not your preferences.
3.5 Gamma as a way of thinking about the future
The choice of Discount Factor (Gamma | γ) is related to the answer to the question: “How much do I care about what happens later?”
3.5.1 Small gamma: the agent only thinks about “now.”
Small γ is used for short-term focus applications. Gamma between 0.1–0.5. Reactive applications, with a short horizon, where the distant future is uncertain or irrelevant. The agent “reacts quickly“. Applications such as:
- fast arcade games (Pong, Breakout): Reactions in seconds, focus on immediate score.
- industrial control (balancing small robots, like CartPole with short times): Instant adjustments, without complex plans.
- alert systems (real-time fraud detection): Immediate penalties for errors, not over long sessions.
3.5.2 Large gamma –> the agent thinks about “later.”
Large γ is used for long-term focus applications. Gamma between 0.95–0.99. Applications with a long horizon, where a bad decision today affects tomorrow. The agent “thinks strategically.”
Applications such as:
- strategy games (chess, Go): Plan moves over dozens of rounds.
- autonomous navigation (self-driving cars): Trajectories in minutes/hours, avoiding distant collisions.
- financial optimization (investment portfolios): Maximizing profit in years, not days.
Gamma completely changes the “personality” of the agent.
INFO: To further understand the role of discount factor in RL, I recommend you to read this tutorial: Discount Factor (gamma) Explained With Q-Learning + CartPole
3.6 What happens when you choose wrong
If:
- the episode is too short,
- the horizon is poorly chosen,
- the gamma does not fit,
then problems arise such as:
- instability – the agent behaves chaotically,
- slow learning – it seems like it is not progressing,
- bad but stable policies – the agent always does the same wrong thing.
These problems do not come from the algorithm. They come from the way the MDP components are defined.
3.7 Comparative examples
CartPole (episodic):
- has a beginning and an end,
- gamma can be large,
- the goal is clear: don’t let the pole fall.
Robot that walks (continuous):
- has no “end“,
- must always be stable,
- gamma must be chosen carefully for control.
The same RL method behaves differently in the two cases.
4. MDP vs POMDP: What happens when you don’t see the whole state?

On the right you see only part –> you have to guess.
If we were to play a game where:
- there is fog on the screen,
- we only see part of the map,
- some information is hidden or delayed.
The game is no longer fair and clear, even if the rules are the same.
This is what happens in real life:
- AI agents don’t see everything,
- sensors make mistakes,
- information comes with a delay.
This is where POMDP comes in.
The goal of this chapter is to show:
- why almost all real problems are POMDP,
- why the Markov hypothesis breaks down in practice,
- what problems arise when the agent doesn’t see everything.
After this chapter, the reader understands:
- why RL goes smoothly in perfect simulations,
- why it becomes difficult in the real world,
- why POMDP is the real model of the world.
First of all, I will explain some concepts as clearly as possible. These explanations will help us move more easily from theory to practice. From MDP to POMDP.
The environment = it is everything that is not the agent. Not the algorithm, not the network, not the observation.
Concrete example of who is the environment in the case of CartPole:
Environment = physical system:
- cart
- pole
- gravity
- equations of motion
State space = positions, velocities, angles.
Observation = what Gymnasium gives as input to the agent.
State space = all possible real states of the environment. It is the minimum complete information about the world that makes the future independent of the past.
Observation space = what state information the agent receives. It is just what the agent sees in that state, through sensors, possibly incomplete or noisy.
Markov property= the actual state contains all the information needed to predict the future. POMDP occurs when the observation does NOT contain enough information to be Markov.
4.1 Why complete observability is a fragile assumption
MDP assumes that:
- the agent sees “everything that matters” about the situation they are in,
- nothing important is hidden,
- the information is correct and timely.
It’s like playing a game where:
- you see the whole map,
- you see all the pieces,
- there are no surprises.
In reality, this rarely happens. Why? Because the real world is:
- big,
- complex,
- imperfectly observable.
MDP is a convenient assumption, not a faithful description of reality.
4.2 What is partial observability, intuitively
In a POMDP, the agent does not see the complete real state, he only sees an observation.
4.2.1 Real state vs observation
- Real state: how the world really is.
- Observation: what the agent manages to see.
In other words, it is like driving a car at night, in fog, you only see what is illuminated by the headlights, but the road continues beyond what you see.
The agent has to make decisions without seeing everything.
4.3 Real Sources of POMDP
Partial observability occurs for very concrete reasons:
4.3.1 Noise
- sensors are not perfect,
- measurements are approximate.
4.3.2 Latency
- information arrives later,
- decisions are based on old data.
4.3.3 Discrete Sampling
- the world is constantly changing,
- the agent only sees it from time to time.
4.3.4 Hidden Variables
- some things cannot be measured directly,
- for example: internal forces, intentions, hidden states.
All of these make the agent not see the complete state.
4.4 Consequences for RL
When the agent does not see everything, it can make good decisions at one time, but bad decisions at another, even from the same observation.
This leads to:
- unstable policies – behavior changes a lot,
- inconsistent learning – the agent seems to learn and then “forget,”
- need for memory – the agent must remember what it saw before.
If we were trying to play chess but were not shown the board for only a second after each move, we would most likely make bad decisions.
4.5 What is a POMDP?
A POMDP is a decision problem in which the agent does not see everything, but must guess the real state from observations.
POMDP is the correct way to think about the real world, because the real world is incompletely observable.
MDP is a useful simplification. POMDP is reality.
4.6 Why RL “works” in perfect simulations and “breaks” in reality
In simulations there is no noise by default. There is no latency. The agent sees the perfect state.
In reality the information is incomplete. Decisions are based on estimates, while errors accumulate.
This is why an agent trained in a perfect MDP can fail completely in the real world.
5. How to map a real problem to a POMDP (example: CartPole)
In this chapter. we use the same learning environment twice:
- In the first environment, the agent uses an MDP, meaning it sees everything;
- In the second environment, the agent only sees one part (POMDP);
It is the best way to understand the difference.
The goal of this chapter is to teach you:
- what a perfect problem looks like (CartPole MDP),
- what the same problem looks like, but realistic (CartPole POMDP),
- how the agent’s behavior changes,
- how we can fix the lack of information through stacking, memory or better observations.
After this chapter, the reader will know how to apply RL in practice, not just in theory.
5.1 CartPole as an ideal MDP
CartPole is one of the most used environments in Reinforcement Learning not because it is realistic, but because it is almost a perfect MDP.
This makes it ideal for learning the fundamentals of RL.
5.2 What the agent sees in CartPole
In CartPole, the agent receives at each step the complete state of the system, in the form of a numerical vector.
In the classical form, the state is:
- the position of the cart on the X axis;
- the speed of the cart;
- the angle of the pole with respect to the vertical;
- the angular velocity of the pole;
These four values completely describe the physical situation of the system at that moment.
The agent:
- knows exactly where the cart is,
- knows exactly how fast it is moving,
- knows exactly how much the pole is tilted,
- knows whether the pole is falling faster or slower.
Nothing relevant is hidden.
5.2.1 Why CartPole is completely observable
A system is completely observable when:
- the observed state contains all the information needed to predict the future,
- if we know what action will be applied.
In CartPole:
- the dynamics are deterministic (or nearly deterministic),
- there are no hidden internal variables,
- there are no noisy sensors,
- there is no latency between measurement and action.
If the agent sees the same state and applies the same action, the future evolution will be the same (within numerical errors).
5.2.2 Connection with the Markov property
CartPole perfectly respects the Markov property, namely that the future depends only on the current state and the current action, not on the complete history.
Therefore CartPole is a valid MDP, not a POMDP.
5.3 Why the past doesn’t matter anymore in CartPole
A very important point for understanding RL is that in CartPole:
- you don’t need to know what happened 10 steps ago,
- you don’t need to memorize the history,
- you don’t need Long Short-Term Memory(LSTM) or other forms of memory.
Because the position and velocity are a complete summary of the cart’s motion. Also, the angle and angular velocity are a complete summary of the pole’s motion.
It’s like seeing the exact position and velocity of a ball. It doesn’t matter how it was thrown anymore. You can calculate exactly where it will end up.
5.4 Why CartPole is an “easy” problem for RL
CartPole is considered “easy” for several fundamental reasons:
5.4.1 The state space is small
CartPole has only 4 dimensions with well-scaled values and no redundancy.
5.4.2 The action space is discrete and small
- only two actions: left / right,
- there is no fine, continuous control.
5.4.3 The reward is dense and clear
- the agent receives +1 for each step in which the pole does not fall,
- the goal is simple: maintain balance as long as possible.
5.4.4 There is no partial observability
- the agent sees everything that matters,
- it does not have to guess anything.
5.4.5 The episode is clearly defined
- it starts in a standard initial state,
- it ends when the pole falls or the cart goes out of bounds.
All this makes RL algorithms learn quickly and stably.
5.5 Interpretation of The Graph: CartPole in MDP Mode (fully observable) and Demo Results

The graph represents the evolution of rollout/ep_rew_mean over time, i.e.:
- the average reward per episode,
- measured over successive training windows,
- as a function of the total number of steps (timesteps).
In CartPole, the reward is:
- +1 for each step in which the pole does not fall,
- in this case, ep_rew_mean is directly proportional to the average duration of the episode.
A score of:
- ~200 –> the agent maintains balance for ~200 steps,
- ~500 –> almost maximum episode (CartPole-v1 has a limit of ≈ 500).
Graphic conclusion
This graph clearly demonstrates that:
- CartPole is an excellent MDP,
- the state is fully observable,
- no memory or history is needed.
DQN works well when the Markov assumption is respected. It learns stably, and converges to a near-optimal policy.
Oscillations do not indicate a modeling problem. They are normal in DQN and arise from exploration and off-policy updates.
This is the correct baseline. Any subsequent degradation (noise, delay) can be directly compared to this behavior.
Demo Results — CartPole as an Ideal MDP

In the evaluation phase, the trained agent achieves a reward of 500 in two of three demo episodes.
This means that:
- the pole remains balanced for the maximum possible duration,
- the policy behaves consistently across episodes,
- no instability or degradation appears during execution.
Because CartPole is a fully observable MDP, the agent does not rely on memory or past observations. Each decision is made solely based on the current state, which already contains all the information required for optimal control.
The perfect score across multiple episodes confirms that:
- the learning process has converged,
- the learned policy is stable,
- the environment formulation as an MDP is correct.
This result serves as a clean baseline for later comparison with partially observable versions of the same task.
5.6 Deliberately transforming an MDP into a POMDP
Up until now, our agent has been “playing a game” perfectly correctly:
- seeing everything that happens,
- seeing the correct information,
- seeing the information in time.
This is an ideal MDP. In the real world, the “game” is not like that.
To understand what happens in practice, we intentionally spoil the environment, step by step.
Not because we want to cheat, but because we want to demonstrate what the “real world looks like.” This “controlled alteration” is called transforming into a POMDP.
5.6.1 Removing some variables
At this step, the agent no longer sees everything.
It’s like playing a game where:
- you see where the ball is,
- but you don’t see how fast it’s moving.
You can guess what’s coming next, but you don’t know for sure.
This is called partial observability.
What we do specifically
In CartPole, before, the agent saw:
- position,
- speed,
- angle,
- angular velocity.
Now, we hide some information from it –> for example velocity.
The agent:
- only sees where it is,
- no longer sees how it moves.
What changes
The same observation can come from:
- a slow movement,
- or a fast movement.
The agent no longer knows exactly what is coming.
Difference between the two graphs

Orange(MDP): “I know exactly what is happening now and what is coming next.”
Blue (POMDP): “I see something, but I don’t know if it is dangerous or not.”
Therefore:
- orange goes up,
- blue gets stuck at the bottom.
Why is the blue line not going up?
Very important:
- It is not the algorithm’s fault,
- It is not the number of steps’ fault.
The problem is that:
- missing information cannot be invented,
- without velocity, the future cannot be predicted correctly.
It is like:
- driving a car,
- you see the road,
- but you don’t see the speed on the dashboard.
Lesson of this graph
If the agent doesn’t see everything that matters, learning stops early.
MDP (orange):
- fast learning,
- stable,
- almost perfect.
POMDP (blue):
- slow learning,
- limited,
- with a clear ceiling.
When velocity is hidden, the agent can no longer predict the future correctly. Even with long training, performance remains limited, not because the algorithm is weak, but because the information is incomplete.
Demo Results — CartPole with Missing Variables (Hidden Velocity)

In the demo phase, the agent achieves rewards of 44, 46, and 32 across three episodes.
This behavior is expected. Because velocity information is hidden, the agent no longer has access to the full state of the system.
Different real situations can look identical from the agent’s point of view, making precise control impossible.
As a result:
- the policy remains unstable,
- performance varies significantly between episodes,
- the agent can no longer consistently maintain balance.
This demonstrates that:
- the limitation does not come from the algorithm,
- but from missing information in the observation.
The agent can still act, but it must guess the future instead of predicting it.
5.6.2 Adding noise
The agent sees what is happening, but sees it wrong. To imagine what the agent sees, we can imagine that:
- you look through a dirty window,
- you see the position of things,
- but not very clearly.
Sometimes you see it right, sometimes you see it wrong.
What we do specifically in the Python program:
Instead of giving the agent:
position = 0.25
we give it:
position = 0.25 + noise
This noise:
- is small,
- random,
- different from step to step.
What changes
The agent:
- no longer receives the same observation for the same real state,
- cannot be sure what is really happening.
Even if the world is the same, the information is imperfect.
Difference between the two graphs
If we looked at the two graphs like two children trying to balance a pole, we would have:
The orange child (MDP) sees the pole clearly, without any problems.
The red child (POMDP with noise) sees the pole, but the image shakes a little all the time.
They both train equally hard.
What the ORANGE line does (MDP, without noise)
The orange agent:
- sees exactly where the pole is,
- learns better and better,
- gets to the very top at the end.
Everything is clear and stable.
What the RED line does (noise)
The red agent:
- sees the pole,
- but the position moves a little with each blink,
- doesn’t know if the pole is really moving or if it’s just the image.
MDP (without noise): “What I see is real.”
POMDP (with noise): “What I see might be a little wrong.”
That little mistake:
- adds up,
- produces wrong moves,
- breaks stability.
What does this graph teach us? Even a little noise makes RL much harder.
The agent:
- can still learn,
- but slower,
- more unstable,
- and with a clear ceiling.
The connection to the real world
In real life:
- sensors are not perfect,
- measurements are noisy,
- data is never “clean.”
This graph shows why RL works perfectly in simulations, but it requires a lot of care in reality.
Demo Results — CartPole with Noisy Observations

In the demo, the agent gets rewards of 185, 132, and 127.
This means:
- sometimes the agent manages to keep the pole balanced for a while,
- sometimes it fails much earlier,
- the result changes from one episode to another.
Why does this happen?
Because the agent sees the world with noise. What it observes is a little bit wrong every time, so it sometimes reacts correctly and sometimes reacts too late or too much.
The agent is not bad or broken. It is doing its best with imperfect information.
The noise is part of the environment, not part of the model. To evaluate the policy correctly, the same observation noise must be applied during both training and testing.
5.6.3 Introducing latency
At this step, the agent sees the past, not the present.
This is the case where the agent sees what happened a second ago, but you have to decide what to do now. It is like playing a game with delay.
What we do specifically
Instead of giving the agent the current state, we give it the state it was 1-2 steps ago. The agent will react to a world that no longer exists at that exact moment.
What changes
Even if the observation is correct, it is old. The agent has to guess what happened in the meantime.
Difference between the two graphs

Imagine trying to catch a ball, but you see where the ball was a second ago, not where it is now.
Orange line (MDP): sees everything in time –> reacts correctly –> learns well.
Blue line (latency): sees the past –> reacts too late –> makes wrong moves.
Therefore:
- the blue line rises a little,
- then falls,
- then oscillates,
- but never reaches the top.
When the agent sees the world too late, it reacts to the past instead of the present, so learning becomes unstable and limited.
Demo Results — CartPole with Latency

In the demo, the agent gets rewards of 12, 28, and 88.
This means:
- sometimes the agent fails almost immediately,
- sometimes it manages to balance the pole a little longer,
- but the result changes a lot from one episode to another.
Why does this happen?
Because the agent sees the world too late. It reacts to where the pole was, not where it is now.
So:
- sometimes it moves too late,
- sometimes it moves too much,
- sometimes it gets lucky for a few steps.
5.6.4 Fully Realistic POMDP

In this setting, the agent faces all the problems at the same time:
- it cannot see everything (missing variables),
- what it sees is a little wrong (noise),
- and what it sees arrives too late (latency).
It is like:
- playing a game,
- where part of the screen is hidden,
- the image is blurry,
- and the screen is delayed.
The agent is not bad or broken. It is simply trying to make decisions with very limited and imperfect information.
When information is missing, noisy, and delayed at the same time, learning almost stops, no matter how long the agent is trained.
Not because:
- the agent is stupid,
- the algorithm is wrong,
- it doesn’t have enough steps.
But because it no longer knows what’s really going on.
Why is this graph so important?
It clearly shows why RL works perfectly in simulations and fails in reality. It also shows us why the problem is not the algorithm, but the information.
Demo Results — Fully Realistic POMDP

In the demo, the agent gets rewards of 10, 12, and 17.
This means:
- the pole falls almost immediately,
- the agent barely has time to react,
- performance is very low and similar in all episodes.
Why does this happen?
Because the agent:
- does not see everything (missing velocity),
- sees wrong information (noise),
- sees it too late (latency).
It is like trying to balance a pole with one eye closed while the image is blurry and everything you see is already from the past.
This result is not a failure. It shows that:
- the algorithm is not the problem,
- more training would not fix it,
- the agent simply does not have enough reliable information to act correctly.
5.7 From MDP to POMDP: CartPole with Hidden State, Noise, and Latency
This script is designed to train and evaluate a DQN agent on CartPole while explicitly showing the difference between an ideal MDP and increasingly realistic POMDP settings.
Instead of changing the algorithm, the script keeps the same DQN setup and progressively changes what the agent can observe.
You can run CartPole in:
- a pure MDP setting, where the agent sees the full state,
- or several POMDP variants, where information is missing, noisy, or delayed.
The goal of this script is not to achieve the best possible score, but to demonstrate how partial observability alone affects learning and behavior.
By toggling simple command-line flags, you can reproduce:
- hidden state variables,
- noisy observations,
- observation latency,
- or all three combined.
This makes the script a controlled experiment that illustrates why Reinforcement Learning often works well in simulations but struggles in real-world systems.
"""
DQN Training and Demo Script for CartPole
-----------------------------------------------
Train or test a DQN agent using Stable Baselines3.
Supports:
- Pure MDP (baseline)
- Partial observability (POMDP) via:
1. Hidden variables
2. Observation noise
3. Observation latency
TRAINING EXAMPLES:
python dqn_cartpole.py --train
python dqn_cartpole.py --train --hide_velocity
python dqn_cartpole.py --train --obs_noise_std 0.02
python dqn_cartpole.py --train --obs_delay 2
python dqn_cartpole.py --train --hide_velocity --obs_noise_std 0.02 --obs_delay 2
DEMO EXAMPLE:
python dqn_cartpole.py --demo --model YOUR_MODEL_PATH.zip
python dqn_cartpole.py --demo --model YOUR_MODEL_PATH.zip --hide_velocity
python dqn_cartpole.py --demo --model YOUR_MODEL_PATH.zip --obs_noise_std 0.02
python dqn_cartpole.py --demo --model YOUR_MODEL_PATH.zip --hide_velocity --obs_noise_std 0.02 --obs_delay 2
Author: Calin Dragos George
"""
import argparse
import os
import time
from collections import deque
import gymnasium as gym
import numpy as np
import torch
from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure
# ---------------------------------------------------------
# Utility: Set seeds for reproducibility
# ---------------------------------------------------------
def set_seed(seed):
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# ---------------------------------------------------------
# Observation Wrappers (POMDP transformations)
# ---------------------------------------------------------
class HideVelocityWrapper(gym.ObservationWrapper):
"""
* removes velocity components from CartPole observation
* keeps only position and angle
"""
def __init__(self, env):
super().__init__(env)
self.observation_space = gym.spaces.Box(
low=np.array([-4.8, -0.418], dtype=np.float32),
high=np.array([4.8, 0.418], dtype=np.float32),
)
def observation(self, obs):
# obs = [x, x_dot, theta, theta_dot]
return np.array([obs[0], obs[2]], dtype=np.float32)
class AddNoiseWrapper(gym.ObservationWrapper):
"""
* adds Gaussian noise to observations
"""
def __init__(self, env, noise_std):
super().__init__(env)
self.noise_std = noise_std
def observation(self, obs):
noise = np.random.normal(0.0, self.noise_std, size=obs.shape)
return obs + noise
class ObservationDelayWrapper(gym.ObservationWrapper):
"""
* returns observations from k steps in the past
"""
def __init__(self, env, delay_steps):
super().__init__(env)
self.delay_steps = delay_steps
self.buffer = deque(maxlen=delay_steps + 1)
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
self.buffer.clear()
for _ in range(self.delay_steps + 1):
self.buffer.append(obs)
return obs, info
def observation(self, obs):
self.buffer.append(obs)
return self.buffer[0]
# ---------------------------------------------------------
# Create CartPole Environment
# ---------------------------------------------------------
def make_env(
render=False,
hide_velocity=False,
noise_std=0.0,
delay_steps=0,
):
if render:
env = gym.make("CartPole-v1", render_mode="human")
else:
env = gym.make("CartPole-v1")
# Apply POMDP transformations
if hide_velocity:
env = HideVelocityWrapper(env)
if noise_std > 0.0:
env = AddNoiseWrapper(env, noise_std)
if delay_steps > 0:
env = ObservationDelayWrapper(env, delay_steps)
env = Monitor(env)
return env
# ---------------------------------------------------------
# Create DQN Model
# ---------------------------------------------------------
def create_model(env, lr, log_dir):
model = DQN(
"MlpPolicy",
env,
learning_rate=lr,
buffer_size=100_000,
learning_starts=1_000,
batch_size=64,
gamma=0.99,
target_update_interval=1_000,
train_freq=4,
exploration_fraction=0.1,
exploration_final_eps=0.02,
verbose=1,
tensorboard_log=log_dir,
)
logger = configure(log_dir, ["stdout", "tensorboard"])
model.set_logger(logger)
model.logger.record("hyperparams/learning_rate", lr)
model.logger.record("hyperparams/buffer_size", 100_000)
model.logger.record("hyperparams/batch_size", 64)
return model
# ---------------------------------------------------------
# ...for saving the model
# ---------------------------------------------------------
def build_experiment_tag(hide_velocity, noise_std, delay_steps):
tags = []
if hide_velocity:
tags.append("noVel")
if noise_std > 0.0:
tags.append(f"noise{noise_std}")
if delay_steps > 0:
tags.append(f"delay{delay_steps}")
if not tags:
return "MDP"
return "_".join(tags)
# ---------------------------------------------------------
# Train DQN
# ---------------------------------------------------------
def train_dqn(lr, timesteps, seed, hide_velocity, noise_std, delay_steps):
set_seed(seed)
env = make_env(
hide_velocity=hide_velocity,
noise_std=noise_std,
delay_steps=delay_steps,
)
exp_tag = build_experiment_tag(hide_velocity, noise_std, delay_steps)
timestamp = time.strftime("%Y%m%d-%H%M%S")
log_dir = f"logs/DQN_CartPole_{timestamp}"
os.makedirs(log_dir, exist_ok=True)
print("\nTraining DQN on CartPole-v1")
print(f"LR: {lr} | Seed: {seed}")
print(f"hide_velocity={hide_velocity}, noise_std={noise_std}, delay={delay_steps}")
print(f"Logging to: {log_dir}\n")
model = create_model(env, lr, log_dir)
model.learn(total_timesteps=timesteps)
model_path = os.path.join(log_dir, f"DQN_CartPole_{exp_tag}.zip")
model.save(model_path)
print(f"\nModel saved to: {model_path}\n")
env.close()
# ---------------------------------------------------------
# Demo DQN Model
# ---------------------------------------------------------
def run_demo(model_path, episodes, hide_velocity, noise_std, delay_steps):
if not os.path.exists(model_path):
print(f"\nModel not found: {model_path}\n")
return
env = make_env(
render=True,
hide_velocity=hide_velocity,
noise_std=noise_std,
delay_steps=delay_steps,
)
model = DQN.load(model_path)
for ep in range(episodes):
obs, _ = env.reset()
done = False
total_reward = 0
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
done = terminated or truncated
print(f"Episode {ep + 1}: Reward = {total_reward}")
env.close()
# ---------------------------------------------------------
# CLI
# ---------------------------------------------------------
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--train", action="store_true")
parser.add_argument("--demo", action="store_true")
parser.add_argument("--lr", type=float, default=3e-4)
parser.add_argument("--timesteps", type=int, default=200_000)
parser.add_argument("--seed", type=int, default=1)
parser.add_argument("--model", type=str, default=None)
parser.add_argument("--episodes", type=int, default=3)
# POMDP flags
parser.add_argument("--hide_velocity", action="store_true")
parser.add_argument("--obs_noise_std", type=float, default=0.0)
parser.add_argument("--obs_delay", type=int, default=0)
args = parser.parse_args()
if args.train:
train_dqn(
args.lr,
args.timesteps,
args.seed,
args.hide_velocity,
args.obs_noise_std,
args.obs_delay,
)
elif args.demo:
if args.model is None:
print("\nERROR: missing --model path\n")
else:
run_demo(
args.model,
args.episodes,
args.hide_velocity,
args.obs_noise_std,
args.obs_delay,
)
else:
print("\nPlease specify --train or --demo.\n")
5.8 How to fix a POMDP
Reinforcement Learning works not because the world is Markovian, but because we make it Markovian enough.
In the previous sections we saw that when:
- important variables are missing,
- observations are noisy,
- information arrives late,
- reinforcement learning degrades or even locks up.
This is not an algorithm problem. It is an information problem. This section explains the three standard ways in which, in practice, a POMDP is transformed into a problem that is “sufficiently Markovian” for RL to work.
5.8.1 Frame Stacking – reconstructing the recent past
In a POMDP, a single observation is not enough. But several consecutive observations can tell a story.
For example:
- you don’t see the velocity,
- but if you see the current position and the position 1 step ago,
- you can infer the direction and approximate velocity.
Frame stacking does exactly this.
When it works well
- lacks velocities or derivatives
- has low latency
- dynamics are relatively slow
Limitations
- increases state size
- does not scale well for complex dynamics
- does not “understand” the past, just concatenates it
Frame stacking is a simple and effective solution, but not smart.
5.8.2 Memory – learning an internal state
Instead of giving the agent the raw past, you let it remember what matters.
It is the difference between being given a log with the last 5 pages, and having a memory that retains the important things.
What does it mean technically
Memory architectures are used, such as:
- LSTM
- GRU
- Recurrent Policies (DRQN, Recurrent PPO etc.)
The agent:
- receives observations one by one,
- updates an internal state,
- uses this state for decision making.
What problem does it solve:
- severe partial observability
- high latency
- persistent noise
- completely hidden variables
Memory allows the agent to:
- filter out noise,
- integrate information over time,
- reconstruct the internal dynamics of the environment.
Limitations
- more unstable training
- harder to debug
- requires more data
- more complex implementation
Memory is the most powerful solution, but also the most expensive.
5.8.3 Enriched observations – adding useful information
Sometimes, the problem is not that RL doesn’t work, but that the agent doesn’t get what it needs.
Instead of fixing the algorithm, fix what the agent observes.
What it means technically
Add to the observation:
- Estimates (e.g. estimated velocity),
- Filters (e.g. averages, derivations),
- Additional sensors,
- Computed state information.
Examples:
- IMU + encoder in robotics,
- Kalman filters,
- Derived signals,
- Timestamps.
When is the best solution
- You have control over the sensors,
- You can pre-process the data,
- You want stability and interpretability.
This is the dominant approach in real robotics.
Limitations
- requires domain expertise,
- may introduce bias,
- is not end-to-end.
Enriched observations are the most pragmatic solution.
5.8.4 How to choose the right solution
| Problem | Recommended Solution |
|---|---|
| Missing speeds | Frame stacking |
| Noise + latency | Memory |
| Real sensors | Enriched observations |
| Simple system | Stacking |
| Complex system | Memory + observations |
You don’t fix a POMDP by making the algorithm smarter. You fix it by giving the agent a better state.
6. Conclusion
Reinforcement Learning is not about smart algorithms. It is about what the agent sees, how correctly it sees, and how early it sees.
If the agent sees everything clearly, then we have an MDP and RL works great. If the agent sees partially, with noise and delay, then we have a POMDP and RL crashes.
The algorithm is not “stupid”. The agent is not “weak”. The problem is the information.
The most important lesson is that you do not make RL better by choosing a more complicated algorithm. You make it better by giving the agent a better state.
That is why in practice we add memory, use frame stacking, enrich observations, filter and estimate the real state.
Not because the theory is wrong, but because the real world is not a perfect MDP.
In theory everything is MDP. In practice almost everything is POMDP.





