Most PPO tutorials show you what to run. This one shows you how PPO actually works – and how to make it stable, reliable, and predictable.
In a few clear sections, you will walk through the full PPO workflow in Stable-Baselines3, step by step. You will understand what happens during rollouts, how GAE is computed, why clipping stabilizes learning, and how KL divergence protects the policy.
You will also learn the six hyperparameters that control PPO’s performance. Each is explained with practical rules and intuitive analogies, so you know exactly how to tune them with confidence.
A complete CartPole example is included, with reproducible code, recommended settings, and TensorBoard logging.
You will also learn how to read three essential training curves – ep_rew_mean, ep_len_mean, and approx_kl – and how to detect stability, collapse, or incorrect learning.
The tutorial ends with a brief look at PPO in robotics and real-world control tasks, so you can connect theory with practical applications.
TABLE OF CONTENTS
In this tutorial, I’ll cover the following subjects:
- BEFORE YOU BEGIN
- WORKFLOW PPO WITH Stable-Baselines3
- ESSENTIAL PPO HYPERPARAMETERS IN SB3
- PRACTICAL EXAMPLE: PPO ON CartPole-v1 WITH Gymnasium AND SB3
- THE THREE METRICS TO KNOW WHETHER PPO IS LEARNING
- PPO IN REAL PROBLEMS AND IN ROBOTICS
- SUMMING UP
- FAQ — PPO WITH Stable-Baselines3
BEFORE YOU BEGIN
Before proceeding with this tutorial, you should have the OpenAI Gymnasium, PyTorch, Stable-Baselines3 and TensorBoard installed on your computer. Follow these next tutorials to install Stable-Baselines3, PyTorch, Gymnasium, and TensorFlow.
- How to Install OpenAI Gymnasium in Windows and Launch Your First Python RL Environment
- Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium
WORKFLOW PPO WITH Stable-Baselines3

A workflow represents the structured workflow or pipeline for implementing and training a reinforcement learning agent using the Proximal Policy Optimization (PPO) algorithm from the Stable Baselines3 (SB3) library.
The workflow
1. Initialize Dependencies and Create the Environment
- Install SB3 + Gymnasium,
- Create the Gymnasium environment (for our example – CartPole-v1),
- Add the Monitor wrapper for episodic logs,
- Optional: create vectorized environments (VecEnv) for speed and stability.
2. Reset the Environment and Prepare the Agent
PPO needs a policy network (actor) and a value network (critic). SB3 creates them automatically when you call:
model = PPO("MlpPolicy", env, ...)
The actor makes the decisions, the critic judges how good they are. SB3 hides all the complicated stuff. It gives you a “fully prepared brain.”
3. Collect Rollouts (The Experience Collection Phase)
- The agent plays in the environment for n_steps × n_envs timesteps,
- Each transition (obs, action, reward, next_obs) is stored in the rollout buffer,
- PPO does not learn while playing. It only learns after collecting a whole batch of experiences.
This is the part that most people don’t understand:
- PPO doesn’t update anything during the game,
- It just collects experiences, then stops and learns from them.
4. Compute Advantages Using GAE (Generalized Advantage Estimation)
- After the rollout is complete, SB3 calculates:
- benefits,
- returns,
- critic values.
- PPO uses GAE for a trade-off between bias and variance.
Without understanding GAE, you don’t understand PPO. It’s the advanced way the algorithm says, “How much better was this action than I expected?“
Info: I’m preparing a tutorial on GAE and how it works, with examples and insights.
5. Normalize Advantages
- SB3 automatically normalizes the advantages:
- eliminates different scales,
- makes learning stable.
- It is a critical step for PPO.
Without normalization, PPO can become unstable and difficult to control.
6. Update the Policy (Optimization Phase)
- PPO makes multiple passes (epochs) over the same batch of data.
- At each pass:
- update the actor using clipped policy gradient,
- update the critic using value loss,
- regularize the policy using entropy.
This is where PPO becomes unique: it uses the same batch of data multiple times, without a replay buffer.
7. Check KL Divergence to Ensure Stability
- PPO checks how much the policy has changed between updates,
- If the change is too large –> updates are limited,
- If it is too small –> learning is too slow.
KL divergence is the “temperature” of the agent.
- Too high = fever (instability).
- Too low = numb (not learning).
8. Repeat the Loop (Rollout → Update → Rollout → Update)
PPO always alternates between:
- Exploration period (plays entire episodes)
- Learning period (optimizes networks)
It’s a continuous loop. PPO never learns “while playing,” but only after a full round of experiences.
ANALOGY
Imagine you want to get better at a video game.
But instead of correcting yourself while you play:
- You play for 10 minutes without stopping,
- When you’re done, you stop and look at everything you’ve done,
- You write down your mistakes and the good things,
- You learn from them,
- You go back to the game and repeat.
That’s exactly what PPO does.
9. Saving and Loading the Model
After training:
model.save("ppo_cartpole")
For demo:
model = PPO.load("ppo_cartpole", env=env)
SB3 only saves the model parameters, not the environment. Therefore, the environment must be created manually at load.
ESSENTIAL PPO HYPERPARAMETERS IN SB3

Essential hyperparameters are those key configurable parameters that control how the algorithm learns and adapts. They influence the stability of training, the speed of convergence, and the overall performance of the reinforcement learning agent.
1. n_steps
n_steps is the number of timesteps that each vectorized environment (each agent) collects before the PPO makes a policy update.
PPO is an on-policy algorithm, so it cannot reuse past experience.
Instead, before each update, it collects an entire “buffer” of experiences.
n_steps controls:
- The size of the rollout per environment
- How often a PPO update is made
- How much information the algorithm has before changing the policy
- The stability of the advantage estimate (GAE)
- The ratio of bias to variance in the estimates
- How long each update iteration takes
- The quality of the gradient (more data = better gradient)
if n_steps is small
It’s like the child plays for 2 seconds, you stop him and explain: “You made a mistake here, here, here.“
The child:
- doesn’t understand much,
- gets confused,
- does both good and bad,
- has no time to learn a pattern.
if n_steps is large
It’s like letting him play for 2 minutes, then telling him: “Look, in the first 20 seconds you did well, then you did wrong, then you corrected it.”
The child:
- sees the big picture,
- understands what he/she has repeated well,
- makes fewer mistakes,
- makes visible progress.
The more rounds you let it play before correcting it, the better it learns.
The recommended values for n_steps depend 100% on the environment, and their choice is NOT intuitive for most people.
How do I choose n_steps in practice depending on the environment (rule of thumb)
We don’t choose n_steps directly.
We choose rollout_size = n_steps × n_envs. n_steps is only half of the equation.
- rollout_size ≥ 2048 → stable
- rollout_size ≥ 4096 → very stable
- rollout_size ≥ 8192 → excellent for hard problems
- If you have 1 environment → n_steps must be large
- If you have 8 environments → n_steps can be smaller
- If you have 16 environments → n_steps can be even very small (64–128)
The equation
n_steps = rollout_size_target / n_envs
Where:
- rollout_size_target = 2048 (minimum)
- rollout_size_target = 4096 (ideal)
- rollout_size_target = 8192 (for continuous control, robotics)
Example 1 – you have only 1 environment
rollout_size = 2048
n_steps = 2048 / 1 = 2048
Example 2 – you have 4 environments in parallel (DummyVecEnv)
rollout_size = 4096
n_steps = 4096 / 4 = 1024
Why did I use n_steps = 1024 in the script for CartPole?
For this tutorial, I run PPO for the CartPole environment. In this practical example, I use 1 single environment. If I were to follow the theory of n_step selection, I should use:
n_steps = 2048(minimum) / 1(environment) = 2048
But I use n_step = 1024. The explanation is that CartPole has very short episodes, so:
- 1024 is enough,
- it is faster in practice,
- it is the value recommended by SB3 for small, fast, discrete environments,
- and it is a perfect compromise between stability and training time.
2. batch_size
In PPO, batch_size is the number of samples (transitions) used simultaneously in an optimization step of the actor and critic networks.
Imagine that PPO has a notebook with 1024 notes (the rollout).
batch_size = how many notes he studies at once.
- If you give him 4 notes at once → he studies chaotically, because there are too few,
- If you give him 1024 at once → it’s too much and he gets confused.
batch_size = the perfect size of pages he studies at each lesson.
PPO does:
- data collection → rollout buffer,
- splits data into mini-batches,
- optimizes actor and critic on each mini-batch,
- repeats over multiple epochs (n_epochs).
batch_size = “chunks into which we divide the experience for learning”.
What influences batch_size in PPO?
Batch size influences network updates (not data collection).
1. Gradient stability
- Small batch → noisy gradient → unstable updates
- Large batch → stable gradient → slow but reliable updates
2. Training speed
- Small batch → many updates → slower training
- Large batch → fewer updates → faster training
3. Generalization capacity
- Large batch = smoother, more robust
- Small batch = can easily overfit on few transitions
4. Effect on policy (actor)
- Small batch → actor jumps “suddenly” in wrong directions
- Large batch → actor moves “smoothly”
5. Effect on critic (value network)
- Small batch → unstable value loss
- Large batch → stable value loss → better calculated advantages
The critical link between batch_size and rollout_size
Important rule: batch_size must be a divisor of rollout_size (or close)
If rollout_size = 2048:
- batch_size = 32 → ok
- batch_size = 64 → ok
- batch_size = 128 → ok
- batch_size = 100 → it is not good
How to choose batch_size in practice?
Rule 1
batch_size = between 32 and 256
For PPO, almost anything in this range works well.
Rule 2
batch_size ≈ rollout_size / 4
This is the favorite formula in SB3 Zoo.
Rule 3
- small batch_size (32–64) → the rate at which the agent learns from each piece of experience is fast, but unstable
- large batch_size (128–256) → slow, but very stable pace
Rule 4
Always multiple of 2 for GPU vectorization and for nice rollout splitting.
Exemplul 1: CartPole-v1
- rollout_size = 1024
- batch_size = 64
Result in 512 update steps per epoch → stable and fast learning.
Exemplul 2: MuJoCo Hopper-v4
- rollout_size = 4096
- batch_size = 256
The result is that PPO is very stable in continuous control.
3. n_epochs
n_epochs controls how many times PPO traverses the entire rollout buffer during a single policy update.
Imagine you have a notebook with 1024 notes about how you played a game.
n_epochs means how many times you read that notebook to learn better.
- If you read it only once → you will forget most of it.
- If you read it 5 times → you will learn well.
Specifically:
- PPO collects a rollout of rollout_size transitions.
- Divides the rollout into mini-batches (batch_size).
- Iterates through all mini-batches n_epochs times.
- Finally, only then does it move on to the next rollout.
What influences n_epochs?
n_epochs only affects the update (learning) phase, not the data collection phase.
1. Learning stability
- too few epochs → the agent learns too little from the rollout
- too many epochs → the agent “exaggerates” the updates, risk of policy collapse
2. Gradient quality
- more epochs = better gradient = “clearer” actions
- too many epochs = gradient overfit on the same rollou
3. KL-divergence
large n_epochs → KL grows too fast → PPO can exceed clip_range
4. Training time
- large n_epochs = slower training
- small n_epochs = fast but weak training
How do I choose n_epochs in practice?
Basic rule: n_epochs between 3 and 10
Why? Because PPO is designed for multiple moderate passes, not extreme ones.
Official SB3 Zoo Recommendations
- Simple discrete environments → 10 epochs
- Complex environments → 5–8 epochs
- Continuous control → 3–5 epochs
A general formula for choosing
If rollout_size is small → you need more epochs. Because you have little data and you need to “squeeze” the information out of it.
If rollout_size is large → you need fewer epochs. Because you have a lot of data anyway.
Exemplu 1 — CartPole-v1 (rollout_size = 1024)
n_epochs: recommended value is 10
Reason:
- simple data
- small rollout
- quick learning
- easy to train critical
Exemplu 2 — MuJoCo Hopper-v4 (rollout_size = 4096 or 8192)
n_epochs: recommended value is 5
Reason:
- continuous
- unstable dynamics
- too many updates ruin the policy
4. learning_rate
learning_rate controls the size of the update steps the algorithm takes during actor and critic optimization.
At each update, PPO changes the neural network parameters by a step proportional to learning_rate.
- High LR → fast, aggressive changes
- Low LR → slow, smooth, precise changes
Imagine that PPO paints a drawing on a sheet of paper.
- If it draws too large lines (high learning rate) → the drawing becomes ugly and broken.
- If it draws very small lines (low learning rate) → it advances with the drawing, but very slowly.
The perfect line size must be found:
- not too thick,
- not too thin.
Between 1e-4 and 3e-4 is the “perfect” size for a simple drawing like CartPole.
What influences learning_rate in PPO?
Learning rate influences the most critical aspects of PPO:
1. Training stability
- LR too high → agent jumps from one strategy to another → reward oscillates → explosive KL
- LR too low → agent learns, but very slowly
2. Clip_range
PPO has clip-range to limit policy changes.
- If LR is too high → algorithm keeps trying to get out of the clip → penalty → chaotic learning.
- If LR is too low → changes are too small → agent stagnates.
3. KL-divergence
Learning rate is the engine that pushes KL up.
- High LR → high KL → “policy collapse”
- Low LR → low KL → safe but slow learning
4. Quality of advantage updates (actor & critic)
High learning rates can make:
- the actor unstable
- the critic inconsistent → GAE becomes noisy
5. Sensitivity to environment complexity
- Simple envs → can tolerate higher LR
- Complex envs → need lower LR for stability
How do I choose learning_rate in practice?
PPO is very sensitive to LR. To fully understand how to choose the correct learning rate in PPO and RL in general, how to test the optimal LR quickly, and practical examples – I wrote a dedicated tutorial: The Complete Guide to Learning Rate in Reinforcement Learning.
5. gamma
gamma is the discount factor used in calculating returns and advantages in RL. The main technical role for gamma is to determine how much the agent takes into account future rewards compared to immediate ones.
Imagine playing a game where you get points:
- 10 points now
- 10 points in 5 seconds
- 10 points in 10 seconds
gamma tells you how much you care about future points.
What influences gamma in PPO?
1. The agent’s horizon
gamma determines how far the agent looks into the future.
- High gamma (0.99–0.995) → looks far into the future
- Low gamma (0.90–0.95) → looks only at immediate rewards
2. Sensitivity of the critic
The critic in PPO is responsible for estimating the values.
- High gamma → difficult to optimize critic
- Low gamma → stable critic, but loses performance
3. Quality of Benefits (GAE)
GAE = combination of gamma and lambda.
- High gamma → benefits tend to be noisy
- Low gamma → benefits become stable, but we spoil long-term performance
4. Final Performance
gamma is one of the 3 parameters that can completely “destroy” learning:
- gamma
- learning_rate
- clip_range
5. Exploration vs. policy stability
- High gamma → agent seeks better long-term strategies
- Low gamma → agent wins quickly but does not become optimal
To fully understand the role of gamma and how it influences Q-Learning, PPO and other algorithms, I wrote a complete tutorial: Discount Factor Explained – Why Gamma (γ) Makes or Breaks Learning (Q-Learning + CartPole Case Study).
6. gae_lambda
gae_lambda (GAE-λ) is the parameter that controls how “long” the credit assignment horizon is used in calculating advantages. PPO does not use the “classic” advantages, but a special estimator called Generalized Advantage Estimation (GAE).
By changing λ, you can go from:
– short, stable, more biased estimates
to
– long, accurate, but noisy estimates
Imagine PPO remembering everything he did in the game to learn. gae_lambda says how much he has to remember.
if λ = 1 –> It’s like a child who remembers every move of the last 10 minutes. It is very smart, but starts to get tired and makes mistakes (loud noise).
if λ = 0 –> It’s like he only remembers what he did in the last 2 seconds. He’s not tired, but he has no idea what the long-term game is.
if λ = 0.95 –> It’s the perfect balance: “I remember enough things, but I don’t overload myself unnecessarily!“
The GAE formula combines gamma (γ) and lambda (λ) in a way that controls for the trade-off:
- variance (noise in the estimates)
- bias (how far it is from the true value)
What influences gae_lambda?
1. How far into the future does the advantage look
- λ = 1 → looks very far (long horizon)
- λ = 0 → looks very little into the future (short horizon)
2. Estimation noise (variance)
- Large λ → large variance (larger noise)
- Small λ → small variance
3. Estimation bias
- Large λ → small bias → more precise estimates
- Small λ → large bias → approximate but stable estimates
4. Stability of the critic
- GAE is sensitive to the dynamics of the critic:
- large λ = sensitive critic, sometimes unstable
- small λ = stable critic, but suboptimal
5. Quality of the policy gradient
- GAE produces the advantages used in the actor update.
- λ correct → stable gradient
- λ wrong → noisy or poorly calibrated gradient
How do I choose gae_lambda in practice?
Rule 1
gae_lambda = 0.95
This is the optimal combination:
- gamma = 0.99
- lambda = 0.95
This pair was used in:
- TRPO paper
- PPO paper
- OpenAI Baselines
- SpinningUp
- Stable-Baselines3 Zoo
We can say that it is the “official” value for PPO.
Rule 2 – if the environment is very noisy
lambda = 0.90
- smoother advantages
- very stable critical
- but agent loses long-term vision
Rule 3 – If the environment is continuous, long, robotic
lambda = 0.97–0.99
- longer horizon advantages
- more robust policy
- but requires good models and large n_steps
Example 1 — CartPole-v1
gae_lambda = 0.95
Reason:
- simple environment
- critically stable
- short episodes
- signal clarity
- SB3 standard
Example 2 — Hopper-v4 (MuJoCo)
gae_lambda = 0.97
Reason:
- continuous control
- dense reward
- long episodes
- long-term benefits are important
- sophisticated critic
Why did I choose gae_lambda = 0.95 for CartPole?
1. It is the recommended value in the original PPO (Schulman 2017)
Combined with γ=0.99, it works perfectly.
2. CartPole has short episodes (max 500 timesteps)
- large λ (above 0.97) does not bring benefits
- small λ (below 0.90) cuts too much from the future
3. The critic in CartPole is easy to train
λ=0.95 does not cause instability
7. clip_range
clip_range is the parameter that defines how much the policy is allowed to change between two updates.
The PPO uses a “clipped policy objective”, meaning it limits the ratio between the new policy and the old policy:
ratio = π_new(a|s) / π_old(a|s)
PPO allows improvements only within the range of:
1 – clip_range ≤ ratio ≤ 1 + clip_range
If the policy changes more than this interval, the update is blocked.
Imagine that PPO is a child trying to learn to balance on a stick.
On each attempt, someone tells him:
“You are only allowed to change your movements A LITTLE from how you did them before.”
clip_range tells how much he is allowed to change the movement.
What influences clip_range?
1. Policy stability
- clip_range = “how aggressive you are allowed to be“,
- small range → small, controlled changes,
- large range → large, aggressive changes → risk of instability.
2. KL divergence
clip_range directly and strongly influences KL:
- small clip_range→ KL is small and stable
- large clip_range→ KL explodes easily
3. Long-term learning
If clip_range is too small:
- the agent learns very slowly
- updates are blocked frequently
If clip_range is large:
- the agent can learn quickly,
- but you risk “policy collapse“
4. Sensitivity to the values of advantages
If the advantages are large, a large clip_range produces huge changes in the actor’s policy. A small clip_range keeps the actor “under control“.
5. Learning_rate compatibility
Learning_rate and clip_range work as a pair:
- High LR + low clip_range = stable
- Low LR + high clip_range = too slow
- High LR + high clip_range = unstable → policy collapse
How do I choose clip_range in practice?
In PPO, clip_range is a very sensitive parameter, but at the same time easy to choose.
The standard is: clip_range = 0.2.
It was introduced in the original PPO by Schulman in 2017.
Other values used in practice
Simple discrete environments: clip_range = 0.1 – 0.2. Good for CartPole, MountainCar, Acrobot, LunarLander.
Continuous and unstable environments: clip_range = 0.1. Used in MuJoCo, PyBullet, robotics for stability.
Fast and low-noise environments: clip_range = 0.2 – 0.3. Atari, games with large discrete actions.
General rule of thumb
- if reward oscillates → decrease clip_range
- if policy learns too slowly → increase clip_range
- if KL diverges → decrease clip_range immediately
Example 1 — CartPole-v1
clip_range = 0.2
Reason:
- low noise
- discrete actions
- clear reward
- simple transitions
- policy learns quickly and stably
Example 2 — Hopper-v4 (MuJoCo)
clip_range = 0.1
Reason:
- continuous control
- unstable dynamics
- dense reward
- small deviations lead to chaotic behavior
- small range = high stability
PRACTICAL EXAMPLE: PPO ON CartPole-v1 WITH Gymnasium AND SB3
CartPole is a classic control environment. It is an environment where a cart moves left and right and has an unstable pole above it.
The goal is to train an agent to keep the pole standing as long as possible.
Reward:
- +1 for each timestep in which the pole does not fall
- The episode ends when the pole falls or the maximum time runs out
Why it is a standard benchmark:
- Short episode,
- Simple dynamics,
- Fast learning,
- Ideal for understanding the main mechanisms of PPO.
Why Use PPO on CartPole?
PPO is stable, general-purpose, and easy to use. CartPole is simple enough that PPO converges quickly.
It is an excellent environment for:
- understanding the PPO workflow,
- interpreting graphs,
- testing hyperparameters,
- introductory tutorials in RL with SB3.
In short, it is the perfect example to properly understand PPO before more complex environments.
Training Script (Ready to Run)
The Python script below trains and demos PPO on CartPole. The logs are saved in PPO/logs/. The saved model can be run in the demo.
"""
PPO Training and Demo Script for CartPole
-----------------------------------------------
Train or test a PPO agent using Stable Baselines3.
TRAINING EXAMPLES:
python ppo_cartpole.py --train
DEMO EXAMPLE:
python ppo_cartpole.py --demo --model "YOUR_PATH/PPO_CartPole_lr_0.0003_seed_1_20251117-113106/PPO_CartPole_lr-0.0003_seed-1_20251117-113106_300000steps.zip"
Author: Calin Dragos George
Updated: 17 November 2025
"""
import argparse
import os
import time
import gymnasium as gym
import numpy as np
import torch
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure
# ---------------------------------------------------------
# Utility: Set seeds for reproducibility
# ---------------------------------------------------------
def set_seed(seed):
"""Ensure reproducible results across NumPy, PyTorch and CUDA."""
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# ---------------------------------------------------------
# Create CartPole Environment
# ---------------------------------------------------------
def make_env(render=False):
"""Create CartPole environment. Render only in demo mode."""
if render:
env = gym.make("CartPole-v1", render_mode="human")
else:
env = gym.make("CartPole-v1")
env = Monitor(env)
return env
# ---------------------------------------------------------
# Create PPO Model (customized for CartPole)
# ---------------------------------------------------------
def create_model(env, lr, log_dir):
"""
Create the PPO agent with a stable and recommended set of hyperparameters
for CartPole-v1.
"""
model = PPO(
"MlpPolicy",
env,
learning_rate=lr,
n_steps=1024,
batch_size=64,
gamma=0.99,
gae_lambda=0.95,
ent_coef=0.0,
clip_range=0.2,
vf_coef=0.5,
max_grad_norm=0.5,
n_epochs=10,
verbose=1,
tensorboard_log=log_dir,
)
# Configure TensorBoard logger
logger = configure(log_dir, ["tensorboard"])
model.set_logger(logger)
# Log hyperparameters for later inspection
model.logger.record("hyperparams/learning_rate", lr)
model.logger.record("hyperparams/n_steps", 1024)
model.logger.record("hyperparams/ent_coef", 0.0)
return model
# ---------------------------------------------------------
# Train PPO
# ---------------------------------------------------------
def train_ppo(lr, timesteps, seed):
set_seed(seed)
env = make_env()
timestamp = time.strftime("%Y%m%d-%H%M%S")
log_dir = f"logs/PPO_CartPole_lr_{lr}_seed_{seed}_{timestamp}"
os.makedirs(log_dir, exist_ok=True)
print("\n Training PPO on CartPole-v1")
print(f"→ Learning Rate: {lr}")
print(f"→ Seed: {seed}")
print(f"→ Logging to: {log_dir}\n")
model = create_model(env, lr, log_dir)
# Train PPO model
model.learn(total_timesteps=timesteps, progress_bar=True)
# Save model
model_filename = f"PPO_CartPole_lr-{lr}_seed-{seed}_{timestamp}_{timesteps}steps.zip"
model_path = os.path.join(log_dir, model_filename)
model.save(model_path)
print(f"\n Model saved to: {model_path}\n")
env.close()
# ---------------------------------------------------------
# Demo PPO Model
# ---------------------------------------------------------
def run_demo(model_path, episodes=1):
"""Load a trained model and run visual evaluation episodes."""
if not os.path.exists(model_path):
print(f"\n Model not found: {model_path}\n")
return
print(f"\n Running Demo: {model_path}\n")
env = make_env(render=True)
model = PPO.load(model_path)
for ep in range(episodes):
obs, _ = env.reset()
done = False
total_reward = 0
while not done:
# Use deterministic actions for evaluation stability
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
done = terminated or truncated
print(f"Episode {ep+1} Reward = {total_reward}")
env.close()
# ---------------------------------------------------------
# CLI with automatic multi-seed training
# ---------------------------------------------------------
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--train", action="store_true")
parser.add_argument("--demo", action="store_true")
parser.add_argument("--lr", type=float, default=3e-4)
parser.add_argument("--timesteps", type=int, default=300_000)
# If None => automatic multi-seed training
parser.add_argument("--seed", type=int, default=None)
parser.add_argument("--model", type=str, default=None)
parser.add_argument("--episodes", type=int, default=3)
args = parser.parse_args()
# ---------------------------------------------------------
# MULTI-SEED LOGIC
# ---------------------------------------------------------
if args.train:
if args.seed is None:
# Run multiple seeds automatically
seeds = [1, 2, 3]
print(f"\n Running PPO with automatic multi-seed training: {seeds}\n")
for s in seeds:
print(f"\n========== TRAINING SEED {s} ==========\n")
train_ppo(args.lr, args.timesteps, s)
else:
# Single-seed run
train_ppo(args.lr, args.timesteps, args.seed)
elif args.demo:
if args.model is None:
print("\n ERROR: missing --model path\n")
else:
run_demo(args.model, args.episodes)
else:
print("\n Please specify --train or --demo.\n")
What You Should Expect to See
- The reward increases to 450–500,
- Episodes are getting longer and longer,
- KL divergence is small and stable.
In the demo, the pole stays in balance almost all the time.

THE THREE METRICS TO KNOW WHETHER PPO IS LEARNING
In PPO there are three graphs that tell you whether the agent is learning correctly or not:
- ep_rew_mean – how well the agent plays on average,
- ep_len_mean – how long it manages to keep the episode “alive”,
- approx_kl – how stable the PPO updates are.
If these three graphs look good, you can confidently say that the PPO training is healthy.

1. ep_rew_mean – The Learning Curve
What it shows:
- Average reward per episode,
- It is the most important graph in PPO.
How do you interpret it:
- Rising curve = agent is learning,
- Flattening curve = agent has reached its limit,
- Falling curve = policy becomes unstable.
In the above chart, the line rises nicely from ~50 to ~420+, just as it should. This means that the PPO model learns correctly and consistently.
2. ep_len_mean – Episod Length Curve
ep_len_mean shows how long an episode lasts before the rod falls into the CartPole.
How do you interpret it:
- Increases → the agent manages to hold the rod longer,
- Stabilizes → the agent reaches maximum performance,
- Decreases → the agent destabilizes.
In the graph we see:
- Increasing curve up to ~420 timesteps,
- Perfect for CartPole (theoretical maximum is 500).
The agent learns to hold the rod longer and longer, a clear sign that the policy is improving.
3. approx_kl – The PPO Stability Signal
approx_kl shows how much the policy changes between updates. PPO is sensitive to changes that are too large (that’s why clip_range exists).
KL divergence must:
- be small,
- be relatively stable,
- not explode.
How can we interpret :
- Small and constant KL → PPO is stable,
- Exploding KL → PPO ruins the policy,
- Almost zero KL → PPO does not learn (blockage).
What we see in the graph:
- The KL is small (3e-3 <–> 7e-3),
- slightly decreasing,
- no big jumps or unstable spikes.
Exactly the ideal behavior for PPO.
What would the graphs look like if PPO didn’t learn?
- ep_rew_mean: flat or zigzag,
- ep_len_mean: increases and then decreases ⇒ unstable,
- approx_kl: explodes ⇒ policy changes too suddenly,
- approx_kl too small ⇒ PPO makes no changes, is “frozen“.
PPO IN REAL PROBLEMS AND IN ROBOTICS
1. PPO is used in controlling a 3–6 DOF robotic arm
Tasks such as reaching, picking, placing, tracking.
Popular simulators:
- MuJoCo
- PyBullet
- Isaac Gym
Why PPO works well:
- Arm movements are continuous control, PPO handles continuous actions well,
- It is stable and does not need a replay buffer,
- GAE (advantage estimation) helps it learn fine movements.
2. Legged Robots & Locomotion
PPO is used in training legged robots:
- ANYmal (quadruped)
- Cassie (biped)
- Simulated humanoids in MuJoCo / Isaac Gym
Why PPO works well:
- PPO can learn complex policies with many actions in parallel,
- Excellent stability when you have many vectorized environments,
- Scalable: you can run 1024–4096 envs in parallel in GPU simulators.
3. Mobile Robots & Navigation
PPO is used autonomous navigation with:
- LiDAR
- cameras
- encoders
- simple sensors
Speed and direction control in continuous action spaces.
PPO works well because:
- Policies are continuous and require fine-tuning,
- The algorithm is robust to sparse and noisy rewards,
- It can learn in a large state space (map-based or sensor-based).
4. Games & Continuous Control Benchmark
PPO is used in:
- Atari (in combination with CNN)
- MuJoCo suite
- Gymnasium continuous tasks (Pendulum, MountainCarContinuous)
SUMMING UP
Proximal Policy Optimization (PPO) is one of the most stable and practical algorithms in the entire RL family.
Stable-Baselines3 makes it extremely easy to use: you create the environment, define the model, and start training. The rest is handled internally by SB3 (GAE, clipping, rollouts, batch update).
What did you learn in this tutorial
1. Complete PPO Workflow in SB3
- Create a Gymnasium environment and apply Monitor.
- Instantiate PPO with “MlpPolicy”.
- Agent collects experiences in rollouts (n_steps × n_envs).
- Calculate benefits with GAE.
- Normalize benefits.
- Optimize actor and critic with PPO clipping.
- Monitor stability with KL divergence.
- Save and load model for demo.
2. Which hyperparameters are really important?
- n_steps, batch_size, n_epochs → determine the pace and stability of learning,
- learning_rate, gamma, gae_lambda, clip_range → control the fine-tuning of the agent’s behavior,
- The configuration used is standard and recommended for CartPole.
3. How do you know if PPO is learning
The three essential graphs:
- ep_rew_mean increases → the policy gets better.
- ep_len_mean increases → the agent keeps the bar in balance longer.
- approx_kl small and stable → healthy and controlled updates.
4. Practical examples in robotics and real-world applications
- Control of robotic arms (3–6 DOF).
- Locomotion on legged robots.
- Mobile navigation with LiDAR and cameras.
- PPO is used extensively in simulators such as MuJoCo, Isaac Gym, PyBullet.
PPO + SB3 = the easiest, most stable, and most reproducible way to learn RL professionally.
CartPole is just the beginning. The same code structure scales directly to robotics, continuous control, and GPU simulations with thousands of environments in parallel.
If you understand workflow, hyperparameters, and the three key graphs, then you understand 80% of what really matters in PPO.
FAQ — PPO with Stable-Baselines3
1. Why is PPO more stable than other RL algorithms?
PPO uses Clipped Policy Objective and monitors KL divergence, which limits how much the policy can change with each update.
This mechanism prevents the “big jumps” that destabilize learning and makes it much more robust than Policy Gradient, reinforce or classic Actor–Critic.
2. What values should I use for hyperparameters in PPO?
Beginners should start with the SB3 defaults, as they are already optimized for stability:
- n_steps = 2048 (for continuous), 1024 (for discrete)
- batch_size = 64 or 128
- learning_rate = 3e-4
- gamma = 0.99
- gae_lambda = 0.95
- clip_range = 0.2
SB3 defaults are the result of hundreds of benchmark experiments in the SB3 Zoo.
3. How can I check if PPO is learning correctly?
Look at 3 graphs:
- ep_rew_mean: should increase
- ep_len_mean: should increase (for CartPole)
- approx_kl: should remain small and stable
If all three look good, then PPO is stable and learning. If one of them looks bad, something is wrong with the environment, hyperparameters, or reward.
4. Why does PPO work well in simulation but not in the real world (robotics)?
For robotics, the gap between simulation and reality (sim2real gap) is large.
For successful transfer you need:
- domain randomization
- noise injection into observations
- friction / mass randomization
- SDE-based exploration (use_sde=True) for continuous control
- observation normalization
Without these techniques, the PPO policy trained in simulation does not generalize to reality.





