If the observations are not normalized, your agent will learn with the brakes on – whether you use PPO, SAC or DQN. In this tutorial, you will see exactly why, how to normalize correctly and how to stabilize your training.
Who is this tutorial for?
This tutorial is for:
- Those who are new to RL/ML and are encountering feature scaling for the first time,
- Practitioners who want a stable pipeline for PPO/DQN/SAC,
- Those who have models that do not converge or converge very slowly,
- Those who work in robotics and want consistent observations for LiDAR/IMU/encoders,
- Those who suspect that the instability comes from faulty scaling.

TABLE OF CONTENTS
In this tutorial, I’ll cover the following subjects:
- INTRODUCTION
- REAL-WORLD FAILURE SYMPTOMS IN ML & RL
- THE MIN-MAX NORMALIZATION FORMULA EXPLAINED
- SCALING TO CUSTOM RANGES: THE GENERAL [a, b] MAPPING
- HOW TO CHOOSE xmin AND xmax (The Only Correct Way in RL)
- WHEN MIN-MAX HELPS AND WHEN IT BREAKS YOUR MODEL
- PRACTICAL CHECKLIST FOR CHOOSING THE RIGHT NORMALIZATION
- PRACTICAL EXPERIMENTS: TRAINING WITH vs. WITHOUT NORMALIZATION
- COMMON QUESTIONS DEVELOPERS ASK ABOUT MIN-MAX SCALING
INTRODUCTION
Scaling is a hidden “hyperparameter”. The choice between min-max, z-score, robust scaling, etc. can influence the model more than a small tuning of learning rate or number of neurons.
Why is it important in RL?
The agent sees the world in numbers. If one number is much larger than the others, the agent thinks it is the most important – even if it is not.
What min-max does is take the smallest number and the largest number in a list and make all the numbers fit between 0 and 1.
You can break the whole pipeline with a small error. Fit scaler on the entire dataset → data leakage → “magical” results in validation, catastrophic in production.

Simple analogy
Think of your agent as a student who gets grades from 1 to 10. If he suddenly gets a ‘score’ of 5000, he will think that 5000 is the most important, even if it is just a different type of measurement. Normalization transforms all the grades into the same system.
REAL-WORLD FAILURE SYMPTOMS IN ML & RL
Without normalization, Reinforcement Learning (RL) models behave badly and unstable. Here’s what it looks like in the real world: jumping rewards, conflicting policies, exploding gradients, and zero convergence.
In robotics, the effect is even clearer: a sensor with a large range completely dominates the rest of the observations – the agent learns unbalanced behavior.
1. Unstable rewards
What happens when the reward becomes unstable?
- the reward increases, decreases, explodes for no reason,
- two consecutive episodes are not similar.
These two symptoms occur when the scaling is wrong. The explanation is that some observations have much larger ranges than others. That is, the case when the agent has “hot spots” that dominate the gradient.
2. Oscillatory policies
An oscillatory policy is another symptom of incorrect normalization.
This leads the agent to:
- learn a behavior → lose it → learn another → lose it,
- seem “schizophrenic” in actions.
The above behaviors are closely related to normalization. If the distribution of observations changes suddenly (due to the range), the policy is completely repositioned at each batch. Thus, the phenomenon of hence oscillation occurs.
3. Gradient explosion
What does gradient explosion mean?
- loss becomes NaN,
- updates become huge,
- neural network collapses.
The reason is also related to normalization. Neural networks are extremely sensitive to scales. Large values in observations have as effect an enormous gradient, and the result is that PPO/DQN/SAC cracks.
4. Lack of convergence
Lack of convergence happens when:
- reward stagnates,
- policy loss/critical loss do not decrease,
- the agent does not learn even though it has a correct model.
The reason is also related to normalization. Gradient descent cannot take “balanced” steps when the data has different scales. For example, a feature with range 0–500 completely subdues a feature with range 0–1.
Real case: mobile robot

In a mobile robot, the distance sensor (0–5m) completely dominates the speed sensor (0–1 m/s). The agent ignores speed.
Observations:
- obstacle distance: 0–5 m,
- speed: 0–1 m/s.
If you DO NOT normalize:
- 5 → appears as a huge number,
- 1 → appears as a small, “unimportant” number.
What does the agent think?
- “Distance is EVERYTHING.”
- “Speed? A small, insignificant noise.”
What happens?
- the agent only learns reactive behaviors over distance,
- controls speed poorly,
- reward fluctuates,
- trajectory becomes unstable,
- training takes unnecessarily long.
This is exactly the perfect example of why min-max normalization matters enormously in RL.
THE MIN-MAX NORMALIZATION FORMULA EXPLAINED
The formula is:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-95ebfcdb3be620d842a43180f904fe6d_l3.png)
Where:
- x – is the number we want to convert,
- xmin – the smallest value. It helps us know “where the scale starts“,
- xmax – the highest value. It shows us how far the scale goes,
- x’ – is the normalized value. It’s like transforming all the heights between 0 and 1,
This formula helps neural networks, RL agents, and any algorithm treat all values proportionally, without one dominating.
SCALING TO CUSTOM RANGES: THE GENERAL [a, b] MAPPING
The formula is:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle x' = a + \frac{(x - x_{\min})(b - a)}{x_{\max} - x_{\min}} \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-0ce4aa9976ab2c98fcc26d84060280ef_l3.png)
Where:
- x – is the number we want to convert,
- xmin – the smallest value. It helps us know “where the scale starts“,
- xmax – the highest value. It shows us how far the scale goes,
- a – the beginning of the new scale. This is the number at which you want the interval to start,
- b – is the number that tells us where we want the new ruler to end,
- (x – xmin) – how “high” the number is above the minimum. Here we find out how much x is above the minimum.
- (b – a) – length of the new scale,
- (x−xmin)(b−a) – the position of the number on the new (but not yet finalized) scale. This is how much the number occupies on the new scale, before adjustment,
- xmax−xmin – the length of the original scale. This number is how large the original range is: from minimum to maximum.
The formula takes a number and moves it to a new ruler, which can start anywhere and end anywhere.
HOW TO CHOOSE xmin AND xmax (The Only Correct Way in RL)
In Reinforcement Learning, the choice of xmin and xmax is one of the most important decisions when applying min-max normalization.
The two values (xmin and xmax) are NOT taken “on the fly“. In RL, they are chosen in a controlled manner, before training, from the sensor spec or from a stable collection of values, not from a random episode.
If you want to know who is the smallest and the tallest in the class, you don’t look at a different class every day. You look at your class once, see the shortest and tallest child, and from then on all the children are compared to them. That’s how you do it in RL. You pick a single ‘smallest‘ and a single ‘tallest‘ and keep those values throughout the learning process.

SOURCE #1 – Sensor / Robot Specification (the most accurate in robotics)
When do you use it?
Real robot or physical simulation where you know the maximum/minimum possible values of each observation.
Examples:
- LiDAR 0 –> 12 m,
- Speed encoder 0 –> 1.2 m/s,
- Heading error −π –> +π,
- Position on axis (known range).
Advantages:
- Maximum stability: all episodes use the same scale –> more stable learning,
- There is no “range shift” between episodes,
- We avoid catastrophe: if x_min and x_max change during training, the agent sees a different world in each episode.
Disadvantages:
- Requires knowledge of the sensor specifications (in real robotics this is not a problem).
SOURCE #2 – Min/max collected from a warmup episode before training
How is it practiced in RL?
- STEP 1: run 10–20 episodes with a random policy,
- STEP 2: Collect all raw observations,
- STEP 3: Calculate x_min and x_max only once, before the agent starts learning.
Advantages:
- You’ll have realistic scale for environments where you don’t have technical specs (e.g. Gymnasium),
- The values does not change during training –> stability.
Disadvantage:
- If during training the agent reaches states that the warmup did not see, values outside the range may appear –> need to clip.
SOURCE #3 – Global min/max for the whole environment (ideal for Gymnasium)
Gym environments almost always have an “official range” for the observation space:
Example CartPole:
- cart position = [−4.8, 4.8],
- speed = [−∞, ∞] (clip to ±3 in practice),
- angle = [−0.418, 0.418].
Example MountainCar:
- position = [−1.2, 0.6],
- speed = [−0.07, 0.07].
These values are directly exposed by the environment:
- env.observation_space.low
- env.observation_space.high
Advantages:
- Perfectly stable, official range,
- No drift,
- Perfect for PPO/DQN/SAC.
Disadvantages:
- Some envs have range = ±inf –> you have to choose a clip manually.
WHEN MIN-MAX HELPS AND WHEN IT BREAKS YOUR MODEL
1. When Min–Max Works Perfectly
Min-max is effective when you know exactly how small and large your values can be.
In general RL:
- Works well when observations have clear boundaries.
Examples: position in CartPole, position/velocity in MountainCar, normalized inputs for PPO/DQN.
In robotics:
- Sensors with fixed physical boundaries (LiDAR 0–30 m, IMU -1/+1, encoders with known range.
If you know the boundaries, min-max produces a stable and safe scale for the agent.
2. When Min–Max Fails (and Why)
Min-max becomes dangerous when the values have no clear boundaries or change over time.
In general RL:
- If observations can become much larger or smaller along the way, min–max breaks down,
- If the agent explores in new states, beyond the initial range, values >1 or <0 appear –> destabilization.
In robotics:
- If the robot gets into unexpected situations (e.g. vibrations, a faulty sensor, extreme brightness rooms), the real range changes –> min–max produces incorrect values resulting in chaotic behavior.
3. Outliers: The Silent Model Killers
A single very abnormal number can destroy normalization.
In general RL:
- A spike in observations (e.g. sudden noise from the sensor) can make the entire vector look identical –> the agent does not learn subtle differences.
In robotics:
- Lidar suddenly sees 999m due to a reflection –> min-max “flattens” all other sensors,
- An IMU value has become corrupted –> all values become almost 0 after normalization.
4. Static Normalization vs Dynamic RL Environments
Min–max assumes that the world does NOT change. RL assumes that the world IS always changing.
In general RL:
- If the range of observations changes over time, min-max starts to give values that do not match what the agent has seen before,
- The agent “learns and then forgets“.
In robotics:
- In the real world, the range of sensors seems constant, but in practice it changes (light, surface, vibrations, unexpected obstacles),
- Static min-max can become “wrong min–max“.
5. Min–Max for Actions vs Observations
Observations and actions have very different roles.
States: Min–max is good if the range is stable.
Actions: It’s like having a car where the steering wheel can only turn between -1 and 1. If you normalize poorly, the car steers wrong.
In general RL:
- Most algorithms require normalized actions between -1 and 1 (SAC, DDPG, TD3),
- Min–max is appropriate here.
In robotics:
- Motors, servos, torques have real physical limits,
- Min–max can elegantly bring them into a uniform range for the agent.
PRACTICAL CHECKLIST FOR CHOOSING THE RIGHT NORMALIZATION
This checklist is short and useful:
- Are my values fixed in real life?
If yes –> min–max is good. - Are there outliers or weird spikes?
If yes –> min–max is not sure. - Will the agent visit new states later?
If yes –> min–max can give values >1 or <0. - Do I need stable training?
RL needs –> consistent normalization. - Do actions have physical limits?
If yes –> normalize them correctly with min–max. - Does the environment drift?
If yes –> min–max becomes fragile. - Do I use PPO/SAC?
Then normalization is very important for gradient stability.
PRACTICAL EXPERIMENTS: TRAINING WITH vs. WITHOUT NORMALIZATION
LunarLanderContinuous is an excellent environment for demonstrating how normalization can help – but also how it can break a reinforcement learning agent when used incorrectly. It is part of OpenAI Gymnasium and very easy to run.
Note: If you don’t have the OpenAI Gymnasium, SB3 and Pytorch installed, follow this tutorial: Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium
The LunarLanderContinuous-v3 environment
LunarLanderContinuous has observations with extremely different scales:
- coordinates (−1…+1)
- velocities (−∞…+∞)
- angles (−π…+π)
- large angular velocities
- binary contact signals
Because these signals live on different scales, you might think that Min-Max normalization is always helpful. Yes, it is helpful, but NOT in all cases.
Why Min-Max Normalization Can BREAK PPO on LunarLanderContinuous
In RL, exploration depends heavily on the natural differences between states.
When you apply Min-Max:
- big differences between states get compressed into the range [0,1],
- small signals become almost invisible,
- the policy sees “flat”, overly-similar observations,
- PPO receives a weaker learning signal,
- exploration collapses,
- the agent can get stuck in a bad policy early in training.
As a result:
- Without normalization –> the agent learns very well,
- With Min-Max normalization –> the agent can fail completely.
Exactly as shown in the training curves below:
- reward increases smoothly without normalization,
- while Min-Max normalization keeps the agent stuck at low reward.

PPO + LunarLanderContinuous-v3 — Complete script with/without MinMax Normalization
This Python script (ppo_lunarlander.py), uses Stable Baselines3 for PPO and includes the optional Min-Max normalization wrapper.
"""
PPO Training and Demo Script for LunarLanderContinuous-v3
---------------------------------------------------------
Train or test a PPO agent using Stable Baselines3.
Supports training WITH or WITHOUT Min-Max Normalization.
RUN EXAMPLES:
# Without normalization (raw observations)
python ppo_lunarlander.py --train --lr 3e-4 --timesteps 300000
# With Min-Max normalization
python ppo_lunarlander.py --train --normalize --lr 3e-4 --timesteps 300000
# Demo
python ppo_lunarlander.py --demo --model <path>
Author: Calin Dragos George
Updated: 22 November 2025
"""
import argparse
import os
import time
import gymnasium as gym
import numpy as np
import torch
from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure
# ---------------------------------------------------------
# Utility: Set seeds for reproducibility
# ---------------------------------------------------------
def set_seed(seed):
"""Set random seeds for reproducibility."""
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# ---------------------------------------------------------
# Min-Max Observation Normalization Wrapper
# ---------------------------------------------------------
class MinMaxObservationWrapper(gym.ObservationWrapper):
"""
Normalize observations to the [0,1] range using:
(x - min) / (max - min)
based on the environment's observation_space bounds.
"""
def __init__(self, env):
super().__init__(env)
self.low = env.observation_space.low
self.high = env.observation_space.high
# New normalized observation space is [0,1]
self.observation_space = gym.spaces.Box(
low=np.zeros_like(self.low),
high=np.ones_like(self.high),
dtype=np.float32,
)
def observation(self, obs):
# Perform Min-Max normalization safely
return (obs - self.low) / (self.high - self.low + 1e-8)
# ---------------------------------------------------------
# Create LunarLanderContinuous Environment (with or without normalization)
# ---------------------------------------------------------
def make_env(normalize=False):
"""
Create the LunarLanderContinuous-v2 environment.
If normalize=True, apply Min-Max Observation Normalization.
"""
env = gym.make("LunarLanderContinuous-v3")
if normalize:
env = MinMaxObservationWrapper(env)
env = Monitor(env)
return env
# ---------------------------------------------------------
# Create PPO Model
# ---------------------------------------------------------
def create_model(env, lr, log_dir):
model = PPO(
"MlpPolicy",
env,
learning_rate=lr,
n_steps=2048,
batch_size=64,
gamma=0.99,
gae_lambda=0.95,
ent_coef=0.01,
clip_range=0.2,
vf_coef=0.5,
max_grad_norm=0.5,
verbose=1,
tensorboard_log=log_dir,
)
logger = configure(log_dir, ["tensorboard"])
model.set_logger(logger)
model.logger.record("hyperparams/learning_rate", lr)
model.logger.record("hyperparams/n_steps", 2048)
model.logger.record("hyperparams/ent_coef", 0.01)
return model
# ---------------------------------------------------------
# Train PPO
# ---------------------------------------------------------
def train_ppo(lr, timesteps, seed, normalize):
set_seed(seed)
env = make_env(normalize=normalize)
timestamp = time.strftime("%Y%m%d-%H%M%S")
norm_flag = "norm" if normalize else "raw"
log_dir = f"logs/PPO_Lunar_{norm_flag}_lr_{lr}_seed_{seed}_{timestamp}"
os.makedirs(log_dir, exist_ok=True)
print("\n Training PPO on LunarLanderContinuous-v3")
print(f"→ Learning Rate: {lr}")
print(f"→ Seed: {seed}")
print(f"→ Normalize: {normalize}")
print(f"→ Logging to: {log_dir}\n")
model = create_model(env, lr, log_dir)
model.learn(total_timesteps=timesteps, progress_bar=True)
model_filename = f"PPO_LunarLander_{norm_flag}_lr-{lr}_seed-{seed}_{timestamp}_{timesteps}steps.zip"
model_path = os.path.join(log_dir, model_filename)
model.save(model_path)
print(f"\n Model saved to: {model_path}\n")
env.close()
# ---------------------------------------------------------
# Demo PPO Model
# ---------------------------------------------------------
def run_demo(model_path, episodes=3):
if not os.path.exists(model_path):
print(f"\n Model not found: {model_path}\n")
return
print(f"\n Running Demo: {model_path}\n")
# Demo is always run without normalization here.
env = make_env(normalize=False)
model = PPO.load(model_path)
for ep in range(episodes):
obs, _ = env.reset()
done = False
total_reward = 0
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
done = terminated or truncated
print(f"Episode {ep+1} Reward = {total_reward}")
env.close()
# ---------------------------------------------------------
# CLI with automatic multi-seed training
# ---------------------------------------------------------
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--train", action="store_true")
parser.add_argument("--demo", action="store_true")
parser.add_argument("--lr", type=float, default=3e-4)
parser.add_argument("--timesteps", type=int, default=300_000)
parser.add_argument("--seed", type=int, default=None)
parser.add_argument("--model", type=str, default=None)
parser.add_argument("--episodes", type=int, default=3)
parser.add_argument("--normalize", action="store_true",
help="Use Min-Max normalization on observations.")
args = parser.parse_args()
if args.train:
if args.seed is None:
seeds = [1, 2]
print(f"\n Running PPO with automatic multi-seed training: {seeds}\n")
for s in seeds:
print(f"\n========== TRAINING SEED {s} ==========\n")
train_ppo(args.lr, args.timesteps, s, args.normalize)
else:
train_ppo(args.lr, args.timesteps, args.seed, args.normalize)
elif args.demo:
if args.model is None:
print("\n ERROR: missing --model path\n")
else:
run_demo(args.model, args.episodes)
else:
print("\n Please specify --train or --demo.\n")
How to Run It
1. Without normalization (RAW observations)
It will train PPO on raw observations, without wrapper.
python ppo_lunarlander.py --train --lr 3e-4 --timesteps 300000
2. With Min-Max Normalization
It will apply the MinMaxObservationWrapper wrapper to normalize the observations to [0,1]. The rest is identical with the previous script.
python ppo_lunarlander.py --train --normalize --lr 3e-4 --timesteps 300000
3. DEMO
python ppo_lunarlander.py --demo --model &lt;path&gt;
COMMON QUESTIONS DEVELOPERS ASK ABOUT MIN-MAX SCALING
1. Min-max or Z-score in RL?
Imagine you have two types of friends:
- some very short,
- some very tall.
If you want everyone to fit on a tiny bus, you “squeeze” them to the same size. You can do this in two ways:
1.1 Min-Max
It’s like taking the smallest and the tallest child in the class and making them all between 0 and 1. But if a giant 2-meters-tall child comes along, the whole bus breaks down – everyone else becomes too small.
1.2 Z-Score
This is like saying, “Let’s see how far you are from the class average.” It works better if you have giant or dwarf children, because it doesn’t make the others “invisible.”
The simple conclusion
- Min-Max = good when the values are normal and you don’t have extremes,
- Z-Score = good when you have crazy, very high or very low values.
Important: Stable-Baselines3 uses Z-Score automatically via VecNormalize.
2. Do I need to normalize the actions as well?
Imagine your robot has two motors:
- one can push hard from −100 to +100,
- the other only from −2 to +2.
If you give both raw commands, the robot will think: “The big motor is IMPORTANT, the small motor doesn’t matter.”
What do you do?
You put all the players on the same scale: normalized actions in [−1, +1]
This means that the robot learns correctly: “Both motors are important, they just have different strengths.”





