AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
No Result
View All Result

Exploration vs Exploitation in RL Explained with FrozenLake and DQN

by Dragos Calin
in OpenAI Gymnasium, RL Fundamentals
4
A A
0

Exploration vs Exploitation is one of the most well-known concepts in Reinforcement Learning. At the same time, it is one of the most misunderstood strategies. Most of the time, in demos or in practice, Exploration vs Exploitation is reduced to “epsilon” or a simple slider between “random” and “smart behavior”. In reality, it is not nearly that simple.

Exploration is not a parameter. Exploration is the strategy by which the RL agent decides what experience it will have. It determines what states it sees, what trajectories it discovers, what information ends up in the replay buffer and, implicitly, what truth about the world the agent comes to believe. If this strategy is wrong, the agent can seem stable, look good in graphs and, at the same time, be completely useless in reality. If done correctly, not only does the agent learn better, but it becomes more robust, smarter and more adaptable.

You will learn from this tutorial:

  • You will understand that Exploration vs Exploitation is not a button, it is not “epsilon“, but a real data collection strategy, which decides what the agent can learn and how good it can become.
  • You will see why the training reward can lie to you, why an agent without exploration can look “better” on the graph, but actually be weaker in reality.
  • You will learn where exploration actually occurs in an Markov Decision Process(MDP), not only in actions, but also in states and in the agent’s policy; and why this matters enormously.
  • You will understand what exploiting a wrong policy means, how lock-in occurs, why exploiting too early can destroy learning, and what this looks like in practice.
  • You will learn the different types of exploration in modern RL: epsilon, entropy, optimism, uncertainty, curiosity; and what each solves and where it falls short.
  • You will learn to interpret data correctly: when reward means something, when it doesn’t, what entropy means, action diversity, state distribution and seed sensitivity.
  • You will see everything in practice, in a FrozenLake + DQN case study, with three types of exploration: no exploration, large exploration and controlled exploration; and you will understand what is really happening and why.

Table of Contents

  • 1. Exploration vs Exploitation in MDP: the real issue (not the definition)
  • 2. Where does exploration occur in a Markov Decision Process?
  • 3. What does exploitation mean in an MDP?
  • 4. Exploration methods in RL: purpose, mechanism, when to use them
  • 5. How to recognize if the agent is exploring or exploiting (from data)
  • 6. How to control the intensity of exploration (depending on the algorithm)
  • 7. Real-world Case Study: Exploration vs. Exploitation — What do the graphs show, what does the agent learn, and what is the truth?
  • 8. Common Mistakes That Destroy Exploration in RL
  • 9. Conclusion: Exploration vs Exploitation is not a parameter
  • 10. Practical checklist (for real projects)

1. Exploration vs Exploitation in MDP: the real issue (not the definition)

Exploration vs Exploitation in MDP
Exploration vs Exploitation in MDP

Exploration vs exploitation is not:

  • “random vs smart,”
  • “stupid vs intelligent,”
  • “beginning vs end.”

It is a problem of strategic decision over time, in a system where what you do now influences what you will be able to see, learn and do later.

Exploration in MDP is an investment. You do less good things now to have better data in the future.

Exploitation is the monetization of existing knowledge.

1.1 Why exploration in RL means data collection, not just “random choice”

Exploration is not “the agent waves his hands chaotically.” Exploration is the process by which the agent builds his set of experiences on which he will learn the policy.

  • If the data is weak –> the policy is weak.
  • If the data is limited –> the agent “believes” that the world only looks like this.
  • If the data is biased –> the agent learns nonsense, but is very confident in it.

Exploration determines WHAT the agent sees. What the agent sees determines what the agent learns. What the agent learns determines how it acts. How it acts again determines what it sees. It is a causal loop, not a static process.

1.2 How an action changes the distribution of future states

In an MDP, the environment is not static. When you choose an action, you don’t just gain or lose reward. You change the path the agent will take through the world.

Technically, each action influences:

  • which state you will see next,
  • which states you will ever be able to see again,
  • how often you will visit certain states.

Exploration is not local (it doesn’t mean “try another button now”). Exploration is about:

  • which areas of the environment are accessible,
  • which areas remain hidden forever,
  • what types of situations the agent will learn to handle.

A single choice today can completely narrow the future.

1.3 Why can an agent explore a lot and yet learn nothing

It seems paradoxical? It is real. An agent can “explore a lot” and yet:

  • not reach relevant states,
  • see only unimportant situations,
  • collect noisy and irrelevant data,
  • not cover the important state space at all,
  • “see a lot, but understand little.”

Examples

  1. if the agent explores chaotically –> does not return to useful states enough –> does not learn stably,
  2. if the environment is large –> shallow exploration = thin data,
  3. if the reward is rare –> can “explore” for years without ever reaching what matters.

Quantitative exploration does not equal good exploration. What matters is what we explore, not just “that we explore”.

1.4 The fundamental difference from bandits

In RL, actions influence what you can observe. This is one of the most important truths in RL.

In Multi-Armed Bandit:

  • action affects only the reward,
  • the environment has no memory,
  • you do not influence what you will be able to see in the next step,
  • the world does not change depending on you.

You choose, you get a reward, you choose again. That’s it.

In Multi-Armed Bandit –> A/B testing on websites (you choose a button, see immediate clicks). 

In a Markov Decision Process/Partially Observable Markov Decision Process(POMDP):

  • action changes the state,
  • the new state determines what options the agent will have,
  • its decision today changes your future.

In RL, if you choose poorly, you not only get a poor reward, you limit your access to future experience.

RL/MDP/POMDP for an autonomous vehicle –> you choose a wrong turn, you not only lose speed, but you enter a dangerous zone, limiting future experiences.

Here’s the gist:

  • bandits = static exploration,
  • RL MDP/POMDP = dynamic, history-dependent exploration with consequences.

That’s why RL with MDP/POMDP is more complicated and deeper.


2. Where does exploration occur in a Markov Decision Process?

Exploration is not a point. It’s a structure.
Actions → change states → shape policy → change future actions
Exploration is not a point. It’s a structure.
Actions → change states → shape policy → change future actions

Exploration in an MDP is not a single button. It is not just epsilon-greedy. It is not just “random“.

Exploration can occur at three different levels and each has its role:

  1. actions,
  2. states,
  3. policy.

2.1 Exploration in the action space

What does exploration in the action space entail?

It means that the agent, at a given state, intentionally chooses an action that does not seem to be the best based on current knowledge.

That is:

  • the agent knows that ‘action A‘ seems better,
  • but sometimes chooses ‘action B‘ (worse at first glance),
  • to test whether B might actually be better than it thinks.

What is the goal?

The goal is for the agent to discover if there are better alternatives than what it currently ‘thinks.’

In other words:

  • check if it is not mistaken,
  • prevent getting stuck in a local optimal solution,
  • collect more diverse data for learning.

How is action space exploration done?

The most well-known methods are:

  • epsilon-greedy,
  • softmax / Boltzmann exploration,
  • noisy networks,
  • sampling from the actor-critic action distribution.

They all have the same principle. Sometimes we intentionally choose a suboptimal action.

Choosing suboptimal actions intentionally

That’s the key:

  • it’s not a mistake,
  • it’s planned,
  • it’s designed.

It’s like telling the agent “Do something a little worse now to see if you learn something better for later.“

Local and limited effect

Exploration in actions is ‘local‘ that is:

  • affects only the next state,
  • is efficient in simple or small environments,
  • may be insufficient in large or sparse-reward environments.

This results in “Exploring actions” ≠ “exploring the environment.”

2.2 State-space exploration

What does it entail?

Here the discussion gets deeper. State-space exploration means that the agent aims to reach new areas of the environment, not just vary actions around the same part of the environment.

In other words:

  • it doesn’t just swap the “left” action for the “right,”
  • it changes the agent’s life path in the environment.

What is the goal?

The goal is:

  • to increase the coverage of the environment,
  • to see rare situations,
  • to discover opportunities that it would not have encountered otherwise,
  • to collect relevant data in the long term.

In real RL (robotics, navigation, complex tasks), this is critical.

How is it done practically?

Simple epsilon-greedy is no longer enough.

This is where more advanced methods appear:

  • intrinsic motivation / curiosity,
  • exploration bonuses,
  • count-based exploration,
  • state visitation bonuses,
  • novelty search,
  • goal-conditioned exploration,
  • random network distillation (RND).

They all have the same goal, namely to “push” the agent towards new areas of the environment.

Why do you “explore actions” ≠ “explore the environment“

You can vary actions as much as you want, if you still stay in a corner of the environment.

Simple example:

  • the agent walks around a room doing random left-right,
  • explores many actions,
  • but does not explore the environment.

In other words:

  • action exploration = local exploration,
  • state exploration = global / structural exploration.

Distribution of visited states = true metric

In this case, we don’t ask “How much did the agent explore?“

We ask: “What distribution of states did the agent visit?“

If:

  • we have many different states –> good exploration,
  • we have only a few repeated states –> poor exploration.

That’s the real metric.

2.3 Policy-level exploration

This exploration involves going at the structural level.

Policy-level exploration means that the agent’s policy is built in such a way that it “naturally” produces a diversity of behavior, not just occasional “random.”

You’re not just forcing random actions from the outside. Exploration is part of the decision architecture.

What’s the goal?

The goal is:

  • stability in exploration,
  • consistency,
  • diverse but intelligent behavior,
  • the ability to discover entire strategies, not just individual actions.

It is especially important in:

  • policy gradient,
  • PPO,
  • SAC,
  • in general in RL and robotics.

How is it done?

By policy design. The two fundamental types:

Deterministic policies

  • for the same state, we always have the same action,
  • zero structural exploration,
  • you have to “push” the exploration from the outside (epsilon, noise).

Stochastic policies

For the same state, the action is sampled from a distribution. The distribution can be:

  • Gaussian (continuous),
  • Categorical (discrete),

Exploration becomes natural and controllable. But this is also where the concept of entropy appears.

Entropy as a structural mechanism of exploration

Entropy measures “how widespread” the distribution of actions is.

High entropy results in a diverse policy. That is, it explores a lot.

Low entropy results in a certain policy. That is, the agent exploits.

In some algorithms (e.g. SAC), the agent maximizes reward and entropy. Which means that the agent “Learns to be kind, but also curious enough.” This is an elegant and powerful exploration.


3. What does exploitation mean in an MDP?

Confidence vs Reality
Confidence vs Reality

Exploitation does not mean:

  • “take the action with the maximum reward,”
  • “do what seems best and that’s it.”

In an MDP, exploitation involves acting according to the current policy, using what the agent thinks he knows about the world, instead of collecting new information.

This means that:

  • the agent stops questioning his knowledge,
  • reinforces his automatisms,
  • “freezes” his view of the environment.

And here the problems arise, namely:

3.1 Exploitation ≠ action with the maximum reward

Exploitation is not the “real value.” It is the “best decision according to what the agent thinks right now.”

If knowledge is incomplete:

  • the agent exploits something possibly wrong,
  • but it is very certain.

This is a case where RL does not exploit reality. Exploit the agent’s internal model of reality.

3.2 Exploiting a wrong policy

If the policy is weak, but you establish the exploitation too early:

  • the agent repeats a mediocre strategy,
  • stops looking for alternatives,
  • reinforces the mistake.

It’s like a child who learns the multiplication table incorrectly and then repeats it with conviction.

What we need to know in practice is that:

  • if the data was biased –> the policy is biased,
  • if the agent explored badly –> the policy “only sees part of the world,”
  • exploiting an incorrect policy = “stabilizing stupidity.”

3.3 Bootstrapping errors & policy lock-in

In many RL algorithms (Q-learning, DQN, etc.) the agent updates its Q estimates using its own past estimates. So the error can feed on itself.

If Q is wrong, it is used to update, so it stays wrong for a longer time.

So:

  • choose actions based on the wrong Q,
  • collect data that confirms the wrong Q,
  • no alternatives are seen anymore.

It is a vicious circle.

Policy lock-in

It occurs when the agent becomes:

  • very confident in a strategy,
  • very stable,
  • but in reality suboptimal.

Lock-in = the policy remains “locked” on a mode of behavior.

Why does it occur?

  • too little exploration,
  • exploration stopped too early,
  • aggressive decay of epsilon,
  • policy deterministic too fast,
  • lack of novelty incentives.

In this case the agent becomes “rigid“, seems competent, but is limited.

3.4 Why early exploitation can destroy learning

This is probably the most important thing.

If you start exploitation too early:

  • the agent has not yet learned the environment correctly,
  • the data is incomplete,
  • the policy is immature.

But:

  • we can force the agent to stop exploring,
  • we can force the agent to repeat the immature policy,
  • we can force the agent to convince itself that it is right.

The result is that the agent:

  • learns quickly,
  • seems stable,
  • but the maximum ceiling is low,
  • and stops evolving.

In short, early exploitation leads to the agent being locked into a weak but stable solution.


4. Exploration methods in RL: purpose, mechanism, when to use them

Different tools.
Different behaviors.
Different problems they solve.
Different tools.
Different behaviors.
Different problems they solve.

Exploration is not a single technique. There are several families of methods, each solving a different type of problem.

A practitioner needs to know:

  • what each one does,
  • when it is useful,
  • where it is insufficient,
  • what problem it solves and what it does not solve.

4.1 Exploration by actions

What does it do?

It modifies “the choice of actions” at each step.

The agent:

  • sometimes chooses the “best” action,
  • sometimes intentionally chooses something else.

It is local exploration, at the instant decision level.

4.1.1 ε-greedy

  • with probability ε –> random action,
  • with probability 1 − ε –> the “best” action.

Advantage:

  • simple,
  • efficient in small / classic environments (CartPole, FrozenLake).

Limitations:

  • does not direct exploration,
  • can visit the same areas over and over again,
  • poor in large or sparse reward environments.

4.1.2 Action Noise (especially in continuous control)

Ex: noise in DDPG, Gaussian noise.

  • the agent wants an action,
  • you add noise to it,
  • explores “around” the current behavior.

A method good for continuous robotic control and learning stabilization.

Limitations:

  • local exploration,
  • the agent “walks around” the same policy a bit.

4.1.3 Softmax / Boltzmann exploration

Instead of “best vs random“:

  • you select the action proportionally to its value,
  • controlled by “temperature.”

It is a smarter method than random that tests more promising actions. But it does not guarantee the discovery of new regions in the environment.

When does it work and when is it insufficient?

It works well when:

  • small environments,
  • frequent rewards,
  • applied to educational problems / demos,
  • situations where “local exploration” is sufficient.

It is insufficient for:

  • large environments,
  • partially observable environments,
  • environments with very rare rewards (Atari, real navigation),
  • complex robotics.

4.2 Exploration through optimism and uncertainty

Now we are discussing exploration “with head“, not just random.

The fundamental idea is that the agent does not just try random things. The agent mainly tries things about which it is not sure.

4.2.1 Optimistic Initialization

In this method:

  • we set very large initial Q values,
  • the agent believes that the unknown = good potential,
  • will test unexplored areas.

It is a simple and good method in small environments. But it becomes a heavy method in deep RL which has the effect of destabilizing the training.

4.2.2 UCB (Upper Confidence Bound) ideas adapted to RL

Inspired by bandits:

  • takes into account the estimated value,
  • estimation uncertainty,
  • less visited actions become more attractive.

It is a smarter method than epsilon to direct exploration. But in full RL (MDP) it is much harder to apply correctly and can be computationally expensive.

4.2.3 Sampling based on uncertainty

Ex:

  • Bootstrapped DQN,
  • Bayesian RL,
  • Ensemble methods.

The agent maintains more models even in case the models “disagree.” So the environment is uncertain and there is a need to do exploration.

What problem does it solve and what does it NOT solve?

Solves the problem for the case where we have:

  • unexplored areas,
  • epistemic uncertainty,
  • stagnation in a falsely good policy.

It does not solve the problem when:

  • environments have extremely rare rewards,
  • situations where no signal reaches the agent,
  • we have cases where exploration must be strategically oriented in the long term.

4.3 Policy-level exploration (Entropy-based)

Here exploration is no longer a “patch” over decisions. Exploration becomes “part of the agent’s architecture.”

4.3.1 Stochastic Policies

The policy is no longer the same state and the same action. This is the same state for a distribution of actions.

The agent “draws” the action from a controlled distribution.

4.3.2 Entropy Bonus (PPO, A2C, etc.)

In addition to the reward, the agent receives a bonus for high entropy. That is, “Learn, but stay curious!“

It is a natural, stable and controllable way.

4.3.3 Maximum Entropy RL (SAC)

Here it is next-level. The objective includes entropy as the main part, not just the bonus. This means that:

  • the agent looks for policies that are good,
  • but also sufficiently diverse.

SAC is the modern industry standard for:

  • robotics,
  • continuous control,
  • serious applications.

Why does modern RL prefer this approach?

Because it produces:

  • better stability,
  • coherent exploration,
  • no longer relying on hacky epsilon,
  • scalability in complex environments,
  • robustness of the policy.

Exploration becomes intelligent behavior, not “random pushing.”

4.4 State-space exploration (Intrinsic Motivation)

This is the category used when:

  • the reward is rare,
  • the agent never sees success,
  • local exploration is pointless.

In this case the agent receives “pseudo-reward” for:

  • seeing new things,
  • reaching unvisited areas,
  • learning about the environment.

Exploration becomes an end in itself.

4.4.1 Curiosity

The agent has an internal model and receives a surprise bonus. If the environment behaves differently than expected, the agent becomes curious and goes there.

4.4.2 Count-based bonuses

Simple in its idea:

  • if a state has been visited a little –> receives a bonus,
  • if it is very common –> does not receive one.

In large environments, approximation is used. Why is it necessary for sparse rewards?

In environments such as large games, real-world navigation, exploratory robotics, or long tasks, the agent can play for hours without ever reaching a real reward.

Without internal motivation, learning is zero. With internal motivation, the agent discovers the structure of the environment on its own and eventually finds the right path.


5. How to recognize if the agent is exploring or exploiting (from data)

Reading the Truth Behind the Graph
Reading the Truth Behind the Graph

Most practitioners make the mistake of looking only at reward vs timesteps. Analyzing only these two concepts, it is very easy to reach the wrong conclusions.

A mature RL practitioner should know:

  • what the reward does not say,
  • what indicators are really useful,
  • what signs of good vs bad vs insufficient exploration look like.

Reward vs Timesteps: what the graph does not tell you

You cannot decide exploration only from reward. The reasons are the following:

  • during training the agent is still exploring,
  • exploration = intentionally “weaker” actions,
  • the environment can be stochastic,
  • reward can decrease even if the policy improves,
  • an agent without exploration can have a nice reward but it is fragile.

The reward in training tells us “How well the agent plays while still learning.”

But it doesn’t say “How good is the learned policy.” That’s why other metrics are needed.

Real indicators of exploration

5.1 Entropy vs Timesteps

Entropy means how “spread out” the distribution of actions is.

  • high entropy –> varied behavior,
  • low entropy –> agent sure of actions –> exploitation.

What would we like to see in the graph?

  • at the beginning: high entropy,
  • then: gradually decreases,
  • does not decrease too early,
  • does not stay very high indefinitely.

If we see:

  • low entropy quickly –> agent exploits too early,
  • constantly high entropy –> agent does not learn, remains chaotic.

5.2 Diversity of actions

Here we measure the histogram of actions and the frequency of each action over time.

What do we want to see in the graph?

  • at the beginning –> all actions appear,
  • over time –> some become dominant, but not 100% instantly.

If we see:

  • one action = 90% very early –> lock-in,
  • all actions 25% forever –> exploration without learning.

5.3 Distribution of visited states

This is the most important professional indicator.

What are we checking?

  • how many distinct states are visited,
  • how evenly,
  • how they evolve over time.

Interpretation:

  • few repeated states –> the agent “lives in a corner” –> weak exploration,
  • many states –> the agent explores the environment,
  • many states at the beginning, then stabilization –> healthy exploration.

The conclusion is that the truth about exploration lies in “what areas of the world the agent sees” and not just in the reward.

5.4 Seed Sensitivity

A very simple and very powerful test:

  • run the same agent 3-5 times,
  • change only the seed.

If the results differ massively, the agent is most likely luck-dependent and the exploration is weak or unstable. If the results are consistent, it means that the exploration or learning is robust.

It is one of the most used criteria in research.

Clear signs (diagnosis like a RL professional)

Insufficient exploration (a “lazy” agent)

Signs of insufficient exploration are when:

  • the reward grows quickly, then caps out early,
  • the entropy decreases very quickly,
  • the actions are massively dominated by one,
  • the distribution of states is very small or narrow,
  • the agent is very sensitive to the seed.

In other words, the agent found something that “works OK” and got stuck on it.

Chaotic exploration (“crazy” agent)

Signs of chaotic exploration are when:

  • reward is close to 0 for a long time,
  • entropy is constantly very high,
  • learning does not stabilize,
  • agent does random actions without direction,
  • agent visits many states but does not progress.

In other words, the agent walks around, but does not learn anything significant.

Premature exploitation

Signs of premature exploration are when:

  • reward seems good initially but then does not increase at all,
  • entropy becomes close to 0 very early,
  • there is the same trajectory every time,
  • agent seems “competent”, but capped.

In other words, the agent “seems smart, but is limited.” This is one of the most dangerous situations in RL.


6. How to control the intensity of exploration (depending on the algorithm)

How to control the intensity of exploration (depending on the algorithm)
How to control the intensity of exploration (depending on the algorithm)

Exploration is not controlled only with epsilon. The method depends on the algorithm family:

  • Value-based (DQN, Q-Learning),
  • Policy Gradient / Actor-Critic (PPO, A2C, A3C),
  • Continuous modern control (SAC).

It is important to remember that each family has its own exploration “wheel.”

6.1 Value-Based RL (Discrete)

Algorithms: DQN, Deep Q-Learning, Q-Learning, SARSA, etc.

How do you control exploration here?

Exploration control is mainly done by:

  • initial ε,
  • how ε decreases,
  • how low ε goes.

Initial ε – means how “curious” the agent is at the beginning

During training, it is usually started with a large epsilon (≈1.0) so that the agent explores massively. The goal is to discover the structure of the environment.

If we start with an ε that is too small, the agent thinks it knows the world and doesn’t see the good options.

Decay strategies – how to “calm down” exploration

The most used methods are:

  • linear decay,
  • exponential decay,
  • piecewise decay,
  • adaptive decay.

The goal is for the decrease to be gradual, so that we don’t have premature exploitation. If exploitation is too slow, learning becomes difficult.

Common mistakes in value-based RL

1. ε decreases too quickly

  • the agent enters premature exploitation,
  • “looks good early“, then caps out.

2. ε remains too high for too long

  • small reward, the agent remains “confused.”

3. ε final = 0

  • the agent becomes completely rigid,
  • zero diversity –> sensitive to hazard.

4. lack of sufficient replay buffer

  • exploration without stable learning.

6.2 Policy Gradients / Actor-Critic

For algorithms like PPO, A2C, A3C, TRPO, etc.

Here exploration is not controlled with epsilon. Exploration is part of the structure. Policies are stochastic by design.

Main lever: entropy coefficient

  • high entropy → agent is more curious,
  • low entropy → agent becomes confident, exploits.

When do you increase it?

  • at the beginning of training,
  • in large environments,
  • when the agent gets stuck,
  • when you see entropy decreasing too early,
  • when the results depend a lot on the seed.

When do you decrease it?

  • late in training,
  • when the policy is already stable,
  • when the reward oscillates due to too much exploration,
  • when the agent “knows what to do” but continues to be too curious.

Common mistakes in Actor-Critic

1. Entropy coefficient too small

  • policy becomes almost deterministic,
  • premature lock-in.

2. Entropy coefficient too large

  • agent learns slowly,
  • “noisy” reward without clear progress.

3. Misunderstanding the idea

  • entropy should decrease over time, but in a controlled way – not forced instantly.

6.3 Modern Continuous Control (SAC)

The SAC algorithm is an important special case.

As a key concept it is that:

  • Exploration is not an add-on,
  • It is in the agent’s objective.

SAC maximizes:

  • reward,
  • entropy.

Temperature (α) / Entropy Target

Directly controls how important entropy is.

  • Large α → curious agent,
  • Small α → safer agent.

In modern implementations α is adjusted automatically. The agent learns by itself how curious it should be.

Why does exploration not disappear completely in SAC?

Because:

  • SAC does not want a rigid policy,
  • looks for robust policies,
  • maintains a healthy level of diversity.

The result is that:

  • SAC has stable policies,
  • SAC is robust to noise,
  • SAC is good for real robotics.

Common mistakes in SAC

1. If you force entropy reduction manually, you break one of the advantages of SAC.

2. Wrong tuning of α without understanding what it does will lead to unstable performance.

3. confusing SAC with PPO + noise leads to a completely different philosophy.


7. Real-world Case Study: Exploration vs. Exploitation — What do the graphs show, what does the agent learn, and what is the truth?

Sometimes, in Reinforcement Learning, the graphs lie. FrozenLake shows us why “no exploration seems better,” even though we know it shouldn’t. Let’s see what really happens.

In theory, exploration vs. exploitation seems simple. More exploration means better learning. In practice, the reality is a little more complex and much more interesting.

In this chapter, you’ll see for yourself what happens when we put three identical DQN agents to work and change only the way they explore the environment.

Why FrozenLake?

I chose FrozenLake-v1 (via gymnasium.farama.org) (8×8, slippery=True) because:

  • it is small –> so accessible for beginners,
  • it is stochastic –> it sometimes “slips” even if the agent chooses correctly,
  • it has sparse rewards –> you only get 1 if you reach the goal.

It is perfect for understanding the RL intuition. But it is not the ideal environment for dramatic differences between exploration strategies. In larger and harder environments, the differences become huge.

What I did specifically (clear experiment, like in the industry)

I trained 3 identical DQN agents, under the same conditions, with the same hyperparameters:

  • learning_rate = 1e-3
  • buffer_size = 50,000
  • learning_starts = 1,000
  • batch_size = 64
  • gamma = 0.99
  • target_update_interval = 1,000
  • train_freq = 4
  • total_timesteps = 500,000

The only thing that changed was exploration.

The three characters in the experiment

1. Agent without exploration

Practically pure exploitation:

  • eps_initial = 0.0,
  • eps_final = 0.0,
  • fraction = 0.0.

The agent never explores. Or it “gets stuck” on what it learns early.

2. Agent with very high exploration

Aggressive, almost permanent exploration:

  • eps_initial = 1.0,
  • eps_final = 0.9,
  • fraction = 1.0.

The agent is almost always very random and difficult to stabilize.

3. Agent with controlled exploration (the “normal” one)

Slowly decreasing exploration:

  • eps_initial = 1.0,
  • eps_final = 0.1,
  • fraction = 0.6.

The agent explores a lot at first and then starts to exploit it decently.

What does the graph in TensorBoard show?

How it looks the graphs for 3 identical DQN agents for Exploration vs Exploitation
How it looks the graphs for 3 identical DQN agents for Exploration vs Exploitation
  • no exploration –> graph looks best,
  • controlled exploration –> looks weaker,
  • high exploration –> almost zero most of the time.

If you just look at the graph, you would think: “Wow, no exploration is the best agent!“

But that would be a wrong conclusion.

The important truth is that reward in training is different from actual performance.

During training:

  • the agent is still exploring,
  • and exploring means intentionally doing bad things sometimes,
  • so the reward goes down.

In contrast:

  • without exploration –> the agent plays “cleanly,”
  • doesn’t do random actions,
  • so the reward seems better.

But the graph only says: “How well the agent plays while still experimenting.”

It doesn’t say: “How good the learned policy is.”

The graph in TensorBoard doesn’t measure how good the policy is. It measures how well the agent plays while still exploring.

Why does the graph without exploration look “better”?

The simple explanation for the no exploration case:

  • the agent “fixes” itself on a single strategy,
  • it repeats itself,
  • it seems stable,
  • the reward looks “nice,”
  • but the policy is fragile and limited.

With exploration

  • the agent tries new things,
  • sometimes makes mistakes,
  • the reward “looks ugly,”
  • but the policy becomes many times more robust.

The correct verdict: deterministic demo

To find out the truth, we ran the agents without exploration, in a deterministic way:

  • without epsilon,
  • without random,
  • only the learned policy,
  • for 10 episodes.
Deterministic demo
Deterministic demo

Controlled exploration

No exploration

High exploration

What does the image actually look like?

Deterministic demo results:

  • Controlled exploration –> 8 / 10
  • High exploration –> 5 / 10
  • No exploration –> 2 / 10

Main conclusion

  • Controlled exploration produces the best policy.
  • Lack of exploration produces poor policy.
  • Too much exploration produces instability.

Exploration vs Exploitation is not “more vs less“. It’s how much, when and how.

1. No Exploration (2 / 10)

The agent:

  • quickly reaches a strategy,
  • appears stable,
  • but is weak and fragile,
  • has probably “learned something locally,”
  • only reproducible if the environment favors it.

It’s fixed policy lock-in + poor data = poor in reality. Nice reward in training ≠ good agent in reality.

2. High Exploration (5 / 10)

The agent:

  • had enough exploration to learn something useful,
  • but too aggressive exploration led to:
  • unstable learning,
  • weak policy consolidation,
  • high variation between episodes.

It is fixed the agent explores a lot, but does not stabilize the policy. It is an excellent proof that a lot of exploration ≠ good exploration.

3. Controlled Exploration (8 / 10)

It is:

  • initial exploration –> collects good data,
  • gradually decreases –> stabilizes,
  • produces robust policy.

It is the right balance between exploration and exploitation. It perfectly confirms:

  • why exploration schedule matters,
  • why exploration needs to be structured,
  • why modern RL does not just use “epsilon random forever.”

Exploration vs exploitation is not just about the numbers themselves.

  • FrozenLake is stochastic,
  • 10 episodes are not absolute statistics but the differences are consistent and significant,
  • and reinforce the conceptual idea.

The results are not absolute, but the trend is clear and repeatable:

  • no exploration –> weak,
  • chaotic exploration –> unstable,
  • controlled exploration –> most robust policy.

Before running the deterministic demo and trying to reproduce the results from this chapter, make sure your environment is correctly set up. You need a working RL stack with Gymnasium, Stable-Baselines3, PyTorch, and TensorBoard, all installed properly and compatible with each other.
If you do not already have this configuration ready, please follow first the dedicated setup guide: Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux) — PyTorch + Gymnasium

Only after completing that tutorial and confirming that SB3 + Gymnasium + TensorBoard work correctly on your machine, you can safely proceed with the script below and run the training and deterministic demo experiments.

"""
DQN Training and Demo Script for FrozenLake-v1
-----------------------------------------------

Train or test a DQN agent using Stable Baselines3 on the FrozenLake-v1 environment.

This script is specifically designed to illustrate the
exploration vs. exploitation trade-off by changing the DQN
exploration parameters.

SCENARIOS (exploration_mode):
    1) no_exploration
       - The agent always exploits (epsilon = 0).
       - It almost never discovers the goal state unless lucky initialization.

    2) high_exploration
       - The agent keeps a very high epsilon for a long time.
       - It explores a lot but may struggle to converge to a stable policy.

    3) controlled_exploration
       - Start with high epsilon, slowly reduce it to a small value.
       - Usually leads to the best balance in practice.

TRAINING EXAMPLES:
    python dqn_frozenlake.py --train --exploration_mode no_exploration
    python dqn_frozenlake.py --train --exploration_mode high_exploration
    python dqn_frozenlake.py --train --exploration_mode controlled_exploration

DEMO EXAMPLES:
    python dqn_frozenlake.py --demo --model PATH_TO_MODEL.zip

Author: Calin Dragos George 
Available: reinforcementlearningpath.com
"""

import argparse
import os
import time

import gymnasium as gym
import numpy as np
import torch

from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure


# ---------------------------------------------------------
# Utility: Set seeds for reproducibility
# ---------------------------------------------------------
def set_seed(seed: int) -> None:
    """
    Set random seeds for numpy and torch to make experiments
    as reproducible as possible.
    """
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


# ---------------------------------------------------------
# Observation Wrapper: One-hot encode discrete states
# ---------------------------------------------------------
class OneHotWrapper(gym.ObservationWrapper):
    """
    Convert discrete state index into a one-hot encoded vector.

    FrozenLake-v1 has a discrete observation space (0..N-1).
    DQN in Stable Baselines3 expects a Box (continuous) observation.
    This wrapper turns each integer state into a 1D one-hot vector.
    """

    def __init__(self, env: gym.Env):
        super().__init__(env)
        assert isinstance(
            env.observation_space, gym.spaces.Discrete
        ), "OneHotWrapper only works with Discrete observation spaces."

        self.n = env.observation_space.n
        self.observation_space = gym.spaces.Box(
            low=0.0,
            high=1.0,
            shape=(self.n,),
            dtype=np.float32,
        )

    def observation(self, obs):
        one_hot = np.zeros(self.n, dtype=np.float32)
        one_hot[obs] = 1.0
        return one_hot


# ---------------------------------------------------------
# Create FrozenLake Environment
# ---------------------------------------------------------
def make_env(render: bool = False, is_slippery: bool = True) -> gym.Env:
    """
    Create a FrozenLake-v1 environment and wrap it for:
    - optional rendering
    - one-hot observations for DQN compatibility
    - monitoring episode returns
    """
    if render:
        env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=is_slippery, render_mode="human")
    else:
        env = gym.make("FrozenLake-v1", map_name="8x8", is_slippery=is_slippery)

    # One-hot encode the discrete state
    env = OneHotWrapper(env)

    # Monitor to record episode statistics
    env = Monitor(env)

    return env


# ---------------------------------------------------------
# Map exploration modes to DQN epsilon schedule
# ---------------------------------------------------------
def get_exploration_params(exploration_mode: str):
    """
    Return a dictionary with exploration parameters for DQN based
    on the chosen exploration_mode.

    The main parameters are:
        - exploration_initial_eps
        - exploration_final_eps
        - exploration_fraction

    These control how epsilon decays over time.
    """
    if exploration_mode == "no_exploration":
        # No exploration at all: epsilon = 0 always
        return dict(
            exploration_initial_eps=0.0,
            exploration_final_eps=0.0,
            exploration_fraction=0.0,
        )

    elif exploration_mode == "high_exploration":
        # Very high epsilon almost all the time:
        # the agent is very random, hard to converge.
        return dict(
            exploration_initial_eps=1.0,
            exploration_final_eps=0.9,
            exploration_fraction=1.0,
        )

    elif exploration_mode == "controlled_exploration":
        # Standard schedule:
        # start with high epsilon and gradually reduce to a small value.
        return dict(
            exploration_initial_eps=1.0,
            exploration_final_eps=0.1,
            exploration_fraction=0.6,
        )

    else:
        raise ValueError(
            f"Unknown exploration_mode '{exploration_mode}'. "
            f"Valid options: no_exploration, high_exploration, controlled_exploration."
        )


# ---------------------------------------------------------
# Create DQN Model
# ---------------------------------------------------------
def create_model(env: gym.Env, lr: float, log_dir: str, exploration_mode: str) -> DQN:
    """
    Create a DQN model with a specific exploration schedule determined by 'exploration_mode'.
    """
    exploration_params = get_exploration_params(exploration_mode)

    model = DQN(
        "MlpPolicy",
        env,
        learning_rate=lr,
        buffer_size=50_000,
        learning_starts=1_000,
        batch_size=64,
        gamma=0.99,
        target_update_interval=1_000,
        train_freq=4,
        verbose=1,
        tensorboard_log=log_dir,
        **exploration_params,
    )

    logger = configure(log_dir, ["stdout", "tensorboard"])
    model.set_logger(logger)

    # Log hyperparameters for reproducibility and analysis
    model.logger.record("hyperparams/learning_rate", lr)
    model.logger.record("hyperparams/buffer_size", 50_000)
    model.logger.record("hyperparams/batch_size", 64)
    model.logger.record("hyperparams/exploration_mode", exploration_mode)

    for k, v in exploration_params.items():
        model.logger.record(f"hyperparams/{k}", v)

    return model


# ---------------------------------------------------------
# Build experiment tag for filenames
# ---------------------------------------------------------
def build_experiment_tag(exploration_mode: str, is_slippery: bool) -> str:
    """
    Build a short experiment tag for saving models and logs.
    """
    slip_tag = "slippery" if is_slippery else "nonSlippery"
    return f"{exploration_mode}_{slip_tag}"


# ---------------------------------------------------------
# Train DQN
# ---------------------------------------------------------
def train_dqn(
    lr: float,
    timesteps: int,
    seed: int,
    exploration_mode: str,
    is_slippery: bool,
) -> None:
    """
    Train a DQN agent on FrozenLake-v1 with a given exploration mode.
    """
    set_seed(seed)

    env = make_env(is_slippery=is_slippery)

    exp_tag = build_experiment_tag(exploration_mode, is_slippery)
    timestamp = time.strftime("%Y%m%d-%H%M%S")
    log_dir = f"logs/DQN_FrozenLake_{exp_tag}_{timestamp}"
    os.makedirs(log_dir, exist_ok=True)

    print("\nTraining DQN on FrozenLake-v1")
    print(f"LR: {lr} | Seed: {seed}")
    print(f"Exploration mode: {exploration_mode}")
    print(f"is_slippery={is_slippery}")
    print(f"Logging to: {log_dir}\n")

    model = create_model(env, lr, log_dir, exploration_mode)
    model.learn(total_timesteps=timesteps)

    model_path = os.path.join(log_dir, f"DQN_FrozenLake_{exp_tag}.zip")
    model.save(model_path)

    print(f"\nModel saved to: {model_path}\n")
    env.close()


# ---------------------------------------------------------
# Demo DQN Model
# ---------------------------------------------------------
def run_demo(model_path: str, episodes: int, is_slippery: bool) -> None:
    """
    Run a trained DQN model on FrozenLake-v1 and render the environment.
    """
    if not os.path.exists(model_path):
        print(f"\nModel not found: {model_path}\n")
        return

    env = make_env(render=True, is_slippery=is_slippery)
    model = DQN.load(model_path)

    for ep in range(episodes):
        obs, _ = env.reset()
        done = False
        total_reward = 0.0

        while not done:
            # deterministic=True for exploitation in demo
            action, _ = model.predict(obs, deterministic=True)
            action = int(np.asarray(action).item())
            obs, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated

        print(f"Episode {ep + 1}: Reward = {total_reward}")

    env.close()


# ---------------------------------------------------------
# CLI
# ---------------------------------------------------------
if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("--train", action="store_true")
    parser.add_argument("--demo", action="store_true")

    parser.add_argument("--lr", type=float, default=1e-3)
    parser.add_argument("--timesteps", type=int, default=500_000)
    parser.add_argument("--seed", type=int, default=1)

    parser.add_argument("--model", type=str, default=None)
    parser.add_argument("--episodes", type=int, default=10)

    parser.add_argument(
        "--exploration_mode",
        type=str,
        default="controlled_exploration",
        choices=["no_exploration", "high_exploration", "controlled_exploration"],
        help="Exploration schedule for DQN.",
    )

    parser.add_argument(
        "--non_slippery",
        action="store_true",
        help="If set, use FrozenLake with is_slippery=False (easier task).",
    )

    args = parser.parse_args()

    is_slippery = not args.non_slippery

    if args.train:
        train_dqn(
            lr=args.lr,
            timesteps=args.timesteps,
            seed=args.seed,
            exploration_mode=args.exploration_mode,
            is_slippery=is_slippery,
        )

    elif args.demo:
        if args.model is None:
            print("\nERROR: missing --model path\n")
        else:
            run_demo(
                model_path=args.model,
                episodes=args.episodes,
                is_slippery=is_slippery,
            )

    else:
        print("\nPlease specify --train or --demo.\n")


8. Common Mistakes That Destroy Exploration in RL

8.1 “The agent explores… but gets nowhere”

What does it actually mean? The agent:

  • moves around a lot,
  • tries different things,
  • the graph shows activity,

but:

  • does not reach relevant states,
  • does not progress,
  • does not learn useful policies.

Exploration exists, but it is:

  • shallow,
  • local,
  • and undirected.

Why does it happen?

  • random actions ≠ intelligent exploration,
  • the state space is large,
  • the reward is rare,
  • there is no motivation to reach new meaningful areas,
  • policies do not guide exploration.

How do you recognize this behavior?

  • constantly low reward,
  • constantly high entropy,
  • the agent seems “active” but makes no progress,
  • the distribution of states is chaotic but unstructured,
  • the seeds give completely different results.

What does the practitioner need to know?

“A lot of exploration” does not equal “good exploration”. Exploration must:

  • reach relevant states,
  • have a purpose,
  • be structured.

That is why:

  • curiosity,
  • intrinsic reward,
  • exploration bonuses appear.

8.2 “Exploration declines too quickly”

What does it mean?

  • aggressive epsilon decay,
  • small entropy coefficient too early,
  • deterministic agent very quickly.

The agent seems to be doing well at first…but then:

  • flattens,
  • becomes rigid,
  • stops learning.

Why is it a problem?

Because:

  • the agent does not yet know the environment,
  • believes that its current solution is the “truth,”
  • enters policy lock-in,
  • stops discovering alternatives.

How do you recognize this behavior in practice?

  • reward increases rapidly,
  • then becomes flat,
  • entropy decreases steeply,
  • very repetitive actions,
  • extremely sensitive to seed.

What do you need to know?

Exploration is not just for starters. It must:

  • stay long enough,
  • gradually decrease,
  • allow for refinement, not freeze instantly.

8.3 “I confuse noise with exploration”

What conceptual mistake is occurring?

If we have noise on actions, it means we have good exploration. This aspect is completely wrong.

Noise:

  • only disrupts the decision,
  • is local,
  • does not direct the agent,
  • can only destabilize learning.

In real exploration:

  • new states are discovered,
  • trajectories are changed,
  • experience is restructured,
  • useful data is collected.

How do you recognize this case?

  • noisy reward but no increasing trend,
  • high entropy without progress,
  • agent seems “crazy”,
  • chaotic, non-strategic behavior.

What should we know?

Noise is aimless movement. Exploration is informed search.

That’s why modern RL uses:

  • stochastic policies,
  • controlled entropy,
  • maximum entropy objectives,
  • not just “noise injection.”

8.4 “Tune the reward instead of exploration”

The real problem

When RL goes wrong, a practice is to:

  • change reward shaping,
  • add penalties,
  • modify values,
  • complicate the reward.

But the real cause is not reached. The agent does not see the good states, and does not get where it needs to be. A good reward design cannot compensate for poor exploration.

What should we understand from this case?

RL only learns from the data it sees if:

  • it doesn’t reach the place where the reward exists,
  • or it reaches very rarely,
  • not even the most perfect reward helps.

How can such a mistake be recognized?

  • you make 20 reward variants,
  • nothing changes significantly,
  • the police remain superficial,
  • poor progress regardless of shaping.

The real solution in many cases:

  • the reward is not the problem. the problem is the exploration.

In this case, you need:

  • curiosity,
  • bonuses,
  • structured exploration,
  • efficiently controlled entropy.

9. Conclusion: Exploration vs Exploitation is not a parameter

Exploration vs Exploitation is not:

  • “epsilon,”
  • “a config setting,”
  • “a button you make bigger or smaller.”

Exploration vs Exploitation is a data collection strategy that completely shapes what the agent can learn.

Exploration decides:

  • what the agent sees,
  • what experiences it records,
  • what relationships it can discover,
  • how robust the policy becomes,
  • whether it will ever reach the important states.

If the data is limited, biased, or bad, any RL algorithm, no matter how advanced, will fail.

Exploration differs depending on the problem.

9.1 Discrete vs Continuous

In discrete RL (DQN, FrozenLake, CartPole), exploration is often about actions and policies. In continuous RL / robotics, exploration becomes more about:

  • diversity of trajectories,
  • robustness,
  • stability,
  • well-controlled stochastic policies.

This is where “epsilon” is no longer enough.

9.2 Sparse vs Dense Rewards

When we have dense rewards, simple exploration can work.

When we use sparse rewards without curiosity, intrinsic motivation, or bonuses, the agent has nothing to learn. In other words, the agent receives no signal.

Exploration becomes the only way the agent creates the learning context.

9.3 Value-based vs Policy-based RL

For value-based algorithms (DQN, Q-learning), exploration is usually “glued” on top of the policy (epsilon, noise). In the case of policy-based algorithms (PPO, A2C, SAC), exploration is “in the agent’s DNA”:

  • stochastic policies,
  • entropy,
  • maximum entropy RL.

Therefore, a practitioner should not think of exploration the same for all types of RL.

Why are most failures in RL actually failures of exploration?

The most common real causes of “RL not working” are when:

  • the agent does not see enough relevant states,
  • the agent gets where it matters too rarely,
  • exploration decreases too quickly,
  • exploration is chaotic and unproductive,
  • exploration is structurally limited by policy,
  • the agent learns from biased data.

In these cases, it seems that “the algorithm is weak” or “the reward is bad.” But the truth is that the agent did not have the necessary data to be able to learn.

Interpreting graphs

A RL practitioner should know:

  • a beautiful graph can hide poor exploration,
  • an ugly graph can hide good and useful exploration,
  • the training reward is contaminated by exploration.

The training reward shows “how well the agent plays while still exploring.” But it does not show how good the learned policy is.

They are two different things and it is critical not to confuse them.

The importance of the deterministic demo

For real evaluation, the agent is run without exploration, deterministically on multiple episodes and on multiple seeds if possible.

In this way we see:

  • what it learned,
  • how stable it is,
  • how robust it is,
  • whether the result is luck or real learning.

The importance of the environment in interpreting the results

Exploration is not universal. There is no “best exploration.” Exploration depends on:

  • type of environment,
  • size of state space,
  • type of actions,
  • whether the reward is sparse or dense,
  • whether the environment is stochastic,
  • whether there are traps and “local optima,”
  • how long the episodes are.

An agent that seems “weak” in a simple demo may be excellent in a complex environment.

An agent that seems “stable” in a simple test may completely fail in a real one.

The environment dictates strategy.


10. Practical checklist (for real projects)

10.1 What to check when the agent doesn’t learn

Instead of “trying things,” you should follow a logical order. Most of the time, the problem is not the algorithm, but the process.

Step 1 – Check if the agent sees something useful

Critical questions:

  • Do the observations have the necessary information?
  • Are the sensors / features relevant?
  • Is there a large delay or insufficient information?

If the agent doesn’t see what it needs, it means it has nothing to learn.

Step 2 – Check if it reaches relevant states

This often happens:

  • the agent walks around, but doesn’t get where it matters,
  • episodes end before it sees a useful reward.

In this case, you need to check:

  • the distribution of states visited,
  • whether the episodes reach interesting areas.

If not, exploration is the problem.

Step 3 – Check if the reward reaches the agent

Questions:

  • is it getting rewarded frequently enough?
  • is the reward too small? too large?
  • can you see that the reward is actually being applied in the code?
  • are there no sign bugs (do you reward when you should penalize)?

If the reward is not transmitted clearly, the agent has no learning signal.

Step 4 – Check the stability of the training

  • is the loss exploding?
  • is the Q-values ​​growing unrealistically?
  • is the entropy decreasing too quickly?
  • is the performance varying enormously between seeds?

If so, the problem is stability / parameterization, not exploration.

10.2 What parameters not to change at the beginning

When RL doesn’t work, people reflexively do exactly what they shouldn’t, namely:

Don’t start with hyperparameter tuning

Don’t change immediately:

  • learning rate,
  • gamma,
  • batch size,
  • network architecture,
  • target update,
  • replay buffer size.

These rarely solve fundamental problems.

Don’t start by reinventing the reward

If RL doesn’t learn:

  • new terms are added,
  • weights are changed,
  • reward shaping is complicated.

Often the reward is not the problem, but the fact that the agent doesn’t get to see it.

Don’t change the algorithm too early

“DQN doesn’t work → I try PPO → I try SAC → I try something else“

This is a trap. If two different algorithms fail the same way, it’s not the algorithm that’s the problem. The problem is the environment or exploration or settings.

10.3 How to quickly test if the problem is EXPLORATION or REWARD

Test 1 – Run deterministic demo after some training

If the reward in training is “ugly,” but the deterministic demo goes well, then exploration is aggressive (normal), the policy is OK, so it’s not a real bug.

If training seems good, but the deterministic demo is weak, then the agent exploits in training without learning stably, which means that exploration is insufficient.

Test 2 – Increase exploration intentionally

Examples:

  • increase epsilon decay slower,
  • increase entropy coefficient,
  • at SAC, make the temperature higher.

If the real performance increases, the problem is exploration. The problem is not the reward or the algorithm.

Test 3 – Force the agent to reach the reward at least once

Methods:

  • start the agent closer to the goal,
  • shorten episodes,
  • implement minimal shaping.

If after that it learns, the original problem was the lack of exposure to the reward. This is the case of insufficient exploration or sparse rewards.

If the agent still does not learn, the problem is the reward design or faulty logic.

Test 4 – Run multiple seeds

If:

  • seed 1 = good,
  • seed 2 = disaster,
  • seed 3 = average.

Then the agent depends on luck in exploration, which means that exploration is not robust.

But if all the seed results are weak, something fundamentally is wrong.

Tags: ExploitationExplorationMarkov Decision ProcessMDP (Markov Decision Process)POMDP
ShareTweetShareShareSend
Previous Post

From MDP to POMDP: Why Reinforcement Learning Breaks in Practice

Next Post

What is Actor-Critic in Reinforcement Learning?

Related Posts

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows
MuJoCo

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows

March 4, 2026
What is Actor-Critic in Reinforcement Learning?
Deep RL Algorithms

What is Actor-Critic in Reinforcement Learning?

January 20, 2026
Next Post
What is Actor-Critic in Reinforcement Learning?

What is Actor-Critic in Reinforcement Learning?

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About the author

About Dragos Calin

Dragos Calin is a robotics engineer and reinforcement learning practitioner focused on building real-world autonomous and remote-controlled robotics for agriculture, edge-AI robotics, and embedded platforms. His work join simulation, machine learning, and hardware deployment, with a strong emphasis on practical, testable solutions that function outside the lab.

Areas of Expertise:

  • # Reinforcement Learning for Robotics
  • # Autonomous Agricultural Robots
  • # Embedded Systems & Edge AI (Jetson, Raspberry Pi, Arduino)
  • # Robotic Simulation & Sim2Real Workflow
  • # Sensor Fusion & Control Systems
  • # ROS-Based Robotics Development

Tags

Actor-Critic Bellman Equation Evaluation Metrics Exploitation Exploration Hyperparameter Tuning Machine Learning Markov Decision Process MDP MDP (Markov Decision Process) Normalization Partial Observability POMDP Q-Function Replay Buffer Temporal Difference TensorBoard
Newsletter

Subscribe Blog for Latest Updates

To stay updated with our newest projects and tutorials, make sure you subscribe to our newsletter. 

We do not share your information! You can subscribe  at any time. By subscribing you agree to our Privacy Policy.

Stay Tuned – Follow Us

To stay updated with our newest projects and tutorials, make sure you follow us on: Twitter / X

Site Information

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 Reinforcement Learning Path

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
      • CLASSIC DEEP RL APPLICATION
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3

© 2026 Reinforcement Learning Path