AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
No Result
View All Result

Hands-On: Min-Max Normalization In Action

by Dragos Calin
in RL Fundamentals
5
A A
0

If the observations are not normalized, your agent will learn with the brakes on – whether you use PPO, SAC or DQN. In this tutorial, you will see exactly why, how to normalize correctly and how to stabilize your training.

Who is this tutorial for?

This tutorial is for:

  • Those who are new to RL/ML and are encountering feature scaling for the first time,
  • Practitioners who want a stable pipeline for PPO/DQN/SAC,
  • Those who have models that do not converge or converge very slowly,
  • Those who work in robotics and want consistent observations for LiDAR/IMU/encoders,
  • Those who suspect that the instability comes from faulty scaling.
Min-Max Normalization reinforcementlearningpath.com
Min-Max Normalization

TABLE OF CONTENTS

In this tutorial, I’ll cover the following subjects:

  • INTRODUCTION
  • REAL-WORLD FAILURE SYMPTOMS IN ML & RL
  • THE MIN-MAX NORMALIZATION FORMULA EXPLAINED
  • SCALING TO CUSTOM RANGES: THE GENERAL [a, b] MAPPING
  • HOW TO CHOOSE xmin AND xmax (The Only Correct Way in RL)
  • WHEN MIN-MAX HELPS AND WHEN IT BREAKS YOUR MODEL
  • PRACTICAL CHECKLIST FOR CHOOSING THE RIGHT NORMALIZATION
  • PRACTICAL EXPERIMENTS: TRAINING WITH vs. WITHOUT NORMALIZATION
  • COMMON QUESTIONS DEVELOPERS ASK ABOUT MIN-MAX SCALING

INTRODUCTION

Scaling is a hidden “hyperparameter”. The choice between min-max, z-score, robust scaling, etc. can influence the model more than a small tuning of learning rate or number of neurons.

Why is it important in RL?

The agent sees the world in numbers. If one number is much larger than the others, the agent thinks it is the most important – even if it is not.

What min-max does is take the smallest number and the largest number in a list and make all the numbers fit between 0 and 1.

You can break the whole pipeline with a small error. Fit scaler on the entire dataset → data leakage → “magical” results in validation, catastrophic in production.

Magical Validation Results OR Catastrophic Failure in Production
Magical Validation Results OR Catastrophic Failure in Production

Simple analogy

Think of your agent as a student who gets grades from 1 to 10. If he suddenly gets a ‘score’ of 5000, he will think that 5000 is the most important, even if it is just a different type of measurement. Normalization transforms all the grades into the same system.


REAL-WORLD FAILURE SYMPTOMS IN ML & RL

Without normalization, Reinforcement Learning (RL) models behave badly and unstable. Here’s what it looks like in the real world: jumping rewards, conflicting policies, exploding gradients, and zero convergence.

In robotics, the effect is even clearer: a sensor with a large range completely dominates the rest of the observations – the agent learns unbalanced behavior.

1. Unstable rewards

What happens when the reward becomes unstable?

  • the reward increases, decreases, explodes for no reason,
  • two consecutive episodes are not similar.

These two symptoms occur when the scaling is wrong. The explanation is that some observations have much larger ranges than others. That is, the case when the agent has “hot spots” that dominate the gradient.

2. Oscillatory policies

An oscillatory policy is another symptom of incorrect normalization.

This leads the agent to:

  • learn a behavior → lose it → learn another → lose it,
  • seem “schizophrenic” in actions.

The above behaviors are closely related to normalization. If the distribution of observations changes suddenly (due to the range), the policy is completely repositioned at each batch. Thus, the phenomenon of hence oscillation occurs.

3. Gradient explosion

What does gradient explosion mean?

  • loss becomes NaN,
  • updates become huge,
  • neural network collapses.

The reason is also related to normalization. Neural networks are extremely sensitive to scales. Large values ​​in observations have as effect an enormous gradient, and the result is that PPO/DQN/SAC cracks.

4. Lack of convergence

Lack of convergence happens when:

  • reward stagnates,
  • policy loss/critical loss do not decrease,
  • the agent does not learn even though it has a correct model.

The reason is also related to normalization. Gradient descent cannot take “balanced” steps when the data has different scales. For example, a feature with range 0–500 completely subdues a feature with range 0–1.


Real case: mobile robot

Distance sensor vs speed sensor in RL
Distance sensor vs speed sensor in RL

In a mobile robot, the distance sensor (0–5m) completely dominates the speed sensor (0–1 m/s). The agent ignores speed.

Observations:

  • obstacle distance: 0–5 m,
  • speed: 0–1 m/s.

If you DO NOT normalize:

  • 5 → appears as a huge number,
  • 1 → appears as a small, “unimportant” number.

What does the agent think?

  • “Distance is EVERYTHING.”
  • “Speed? A small, insignificant noise.”

What happens?

  • the agent only learns reactive behaviors over distance,
  • controls speed poorly,
  • reward fluctuates,
  • trajectory becomes unstable,
  • training takes unnecessarily long.

This is exactly the perfect example of why min-max normalization matters enormously in RL.


THE MIN-MAX NORMALIZATION FORMULA EXPLAINED

The formula is:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle         x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • x – is the number we want to convert,
  • xmin – the smallest value. It helps us know “where the scale starts“,
  • xmax – the highest value. It shows us how far the scale goes,
  • x’ – is the normalized value. It’s like transforming all the heights between 0 and 1,

This formula helps neural networks, RL agents, and any algorithm treat all values ​​proportionally, without one dominating.


SCALING TO CUSTOM RANGES: THE GENERAL [a, b] MAPPING

The formula is:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle         x' = a + \frac{(x - x_{\min})(b - a)}{x_{\max} - x_{\min}} \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • x – is the number we want to convert,
  • xmin – the smallest value. It helps us know “where the scale starts“,
  • xmax – the highest value. It shows us how far the scale goes,
  • a – the beginning of the new scale. This is the number at which you want the interval to start,
  • b – is the number that tells us where we want the new ruler to end,
  • (x – xmin) – how “high” the number is above the minimum. Here we find out how much x is above the minimum.
  • (b – a) – length of the new scale,
  • (x−xmin​)(b−a) – the position of the number on the new (but not yet finalized) scale. This is how much the number occupies on the new scale, before adjustment,
  • xmax​−xmin – the length of the original scale. This number is how large the original range is: from minimum to maximum.

The formula takes a number and moves it to a new ruler, which can start anywhere and end anywhere.


HOW TO CHOOSE xmin AND xmax (The Only Correct Way in RL)

In Reinforcement Learning, the choice of xmin and xmax is one of the most important decisions when applying min-max normalization.

The two values ​​(xmin and xmax) are NOT taken “on the fly“. In RL, they are chosen in a controlled manner, before training, from the sensor spec or from a stable collection of values, not from a random episode.

If you want to know who is the smallest and the tallest in the class, you don’t look at a different class every day. You look at your class once, see the shortest and tallest child, and from then on all the children are compared to them. That’s how you do it in RL. You pick a single ‘smallest‘ and a single ‘tallest‘ and keep those values ​​throughout the learning process.

The smallest and the tallest in the class
The smallest and the tallest in the class

SOURCE #1 – Sensor / Robot Specification (the most accurate in robotics)

When do you use it?

Real robot or physical simulation where you know the maximum/minimum possible values ​​of each observation.

Examples:

  • LiDAR 0 –> 12 m,
  • Speed ​​encoder 0 –> 1.2 m/s,
  • Heading error −π –> +π,
  • Position on axis (known range).

Advantages:

  • Maximum stability: all episodes use the same scale –> more stable learning,
  • There is no “range shift” between episodes,
  • We avoid catastrophe: if x_min and x_max change during training, the agent sees a different world in each episode.

Disadvantages:

  • Requires knowledge of the sensor specifications (in real robotics this is not a problem).

SOURCE #2 – Min/max collected from a warmup episode before training

How is it practiced in RL?

  • STEP 1: run 10–20 episodes with a random policy,
  • STEP 2: Collect all raw observations,
  • STEP 3: Calculate x_min and x_max only once, before the agent starts learning.

Advantages:

  • You’ll have realistic scale for environments where you don’t have technical specs (e.g. Gymnasium),
  • The values does not change during training –> stability.

Disadvantage:

  • If during training the agent reaches states that the warmup did not see, values ​​outside the range may appear –> need to clip.

SOURCE #3 – Global min/max for the whole environment (ideal for Gymnasium)

Gym environments almost always have an “official range” for the observation space:

Example CartPole:

  • cart position = [−4.8, 4.8],
  • speed = [−∞, ∞] (clip to ±3 in practice),
  • angle = [−0.418, 0.418].

Example MountainCar:

  • position = [−1.2, 0.6],
  • speed = [−0.07, 0.07].

These values ​​are directly exposed by the environment:

  • env.observation_space.low
  • env.observation_space.high

Advantages:

  • Perfectly stable, official range,
  • No drift,
  • Perfect for PPO/DQN/SAC.

Disadvantages:

  • Some envs have range = ±inf –> you have to choose a clip manually.

WHEN MIN-MAX HELPS AND WHEN IT BREAKS YOUR MODEL

1. When Min–Max Works Perfectly

Min-max is effective when you know exactly how small and large your values ​​can be.

In general RL:

  • Works well when observations have clear boundaries.

Examples: position in CartPole, position/velocity in MountainCar, normalized inputs for PPO/DQN.

In robotics:

  • Sensors with fixed physical boundaries (LiDAR 0–30 m, IMU -1/+1, encoders with known range.

If you know the boundaries, min-max produces a stable and safe scale for the agent.


2. When Min–Max Fails (and Why)

Min-max becomes dangerous when the values ​​have no clear boundaries or change over time.

In general RL:

  • If observations can become much larger or smaller along the way, min–max breaks down,
  • If the agent explores in new states, beyond the initial range, values ​​>1 or <0 appear –> destabilization.

In robotics:

  • If the robot gets into unexpected situations (e.g. vibrations, a faulty sensor, extreme brightness rooms), the real range changes –> min–max produces incorrect values ​​resulting in chaotic behavior.

3. Outliers: The Silent Model Killers

A single very abnormal number can destroy normalization.

In general RL:

  • A spike in observations (e.g. sudden noise from the sensor) can make the entire vector look identical –> the agent does not learn subtle differences.

In robotics:

  • Lidar suddenly sees 999m due to a reflection –> min-max “flattens” all other sensors,
  • An IMU value has become corrupted –> all values ​​become almost 0 after normalization.

4. Static Normalization vs Dynamic RL Environments

Min–max assumes that the world does NOT change. RL assumes that the world IS always changing.

In general RL:

  • If the range of observations changes over time, min-max starts to give values ​​that do not match what the agent has seen before,
  • The agent “learns and then forgets“.

In robotics:

  • In the real world, the range of sensors seems constant, but in practice it changes (light, surface, vibrations, unexpected obstacles),
  • Static min-max can become “wrong min–max“.

5. Min–Max for Actions vs Observations

Observations and actions have very different roles.

States: Min–max is good if the range is stable.

Actions: It’s like having a car where the steering wheel can only turn between -1 and 1. If you normalize poorly, the car steers wrong.

In general RL:

  • Most algorithms require normalized actions between -1 and 1 (SAC, DDPG, TD3),
  • Min–max is appropriate here.

In robotics:

  • Motors, servos, torques have real physical limits,
  • Min–max can elegantly bring them into a uniform range for the agent.

PRACTICAL CHECKLIST FOR CHOOSING THE RIGHT NORMALIZATION

This checklist is short and useful:

  1. Are my values ​​fixed in real life?
    If yes –> min–max is good.
  2. Are there outliers or weird spikes?
    If yes –> min–max is not sure.
  3. Will the agent visit new states later?
    If yes –> min–max can give values ​​>1 or <0.
  4. Do I need stable training?
    RL needs –> consistent normalization.
  5. Do actions have physical limits?
    If yes –> normalize them correctly with min–max.
  6. Does the environment drift?
    If yes –> min–max becomes fragile.
  7. Do I use PPO/SAC?
    Then normalization is very important for gradient stability.

PRACTICAL EXPERIMENTS: TRAINING WITH vs. WITHOUT NORMALIZATION

LunarLanderContinuous is an excellent environment for demonstrating how normalization can help – but also how it can break a reinforcement learning agent when used incorrectly. It is part of OpenAI Gymnasium and very easy to run.

Note: If you don’t have the OpenAI Gymnasium, SB3 and Pytorch installed, follow this tutorial: Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium

The LunarLanderContinuous-v3 environment

LunarLanderContinuous has observations with extremely different scales:

  • coordinates (−1…+1)
  • velocities (−∞…+∞)
  • angles (−π…+π)
  • large angular velocities
  • binary contact signals

Because these signals live on different scales, you might think that Min-Max normalization is always helpful. Yes, it is helpful, but NOT in all cases.


Why Min-Max Normalization Can BREAK PPO on LunarLanderContinuous

In RL, exploration depends heavily on the natural differences between states.

When you apply Min-Max:

  • big differences between states get compressed into the range [0,1],
  • small signals become almost invisible,
  • the policy sees “flat”, overly-similar observations,
  • PPO receives a weaker learning signal,
  • exploration collapses,
  • the agent can get stuck in a bad policy early in training.

As a result:

  • Without normalization –> the agent learns very well,
  • With Min-Max normalization –> the agent can fail completely.

Exactly as shown in the training curves below:

  • reward increases smoothly without normalization,
  • while Min-Max normalization keeps the agent stuck at low reward.
The LunarLanderContinuous-v3 training WITHOUT and WITH normalization
The LunarLanderContinuous-v3 training WITHOUT and WITH normalization

PPO + LunarLanderContinuous-v3 — Complete script with/without MinMax Normalization

This Python script (ppo_lunarlander.py), uses Stable Baselines3 for PPO and includes the optional Min-Max normalization wrapper.

"""
PPO Training and Demo Script for LunarLanderContinuous-v3
---------------------------------------------------------
Train or test a PPO agent using Stable Baselines3.
Supports training WITH or WITHOUT Min-Max Normalization.

RUN EXAMPLES:

# Without normalization (raw observations)
python ppo_lunarlander.py --train --lr 3e-4 --timesteps 300000

# With Min-Max normalization
python ppo_lunarlander.py --train --normalize --lr 3e-4 --timesteps 300000

# Demo
python ppo_lunarlander.py --demo --model <path>

Author: Calin Dragos George
Updated: 22 November 2025
"""

import argparse
import os
import time
import gymnasium as gym
import numpy as np
import torch

from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure


# ---------------------------------------------------------
# Utility: Set seeds for reproducibility
# ---------------------------------------------------------
def set_seed(seed):
    """Set random seeds for reproducibility."""
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)

    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


# ---------------------------------------------------------
# Min-Max Observation Normalization Wrapper
# ---------------------------------------------------------
class MinMaxObservationWrapper(gym.ObservationWrapper):
    """
    Normalize observations to the [0,1] range using:
    (x - min) / (max - min)
    based on the environment's observation_space bounds.
    """

    def __init__(self, env):
        super().__init__(env)
        self.low = env.observation_space.low
        self.high = env.observation_space.high

        # New normalized observation space is [0,1]
        self.observation_space = gym.spaces.Box(
            low=np.zeros_like(self.low),
            high=np.ones_like(self.high),
            dtype=np.float32,
        )

    def observation(self, obs):
        # Perform Min-Max normalization safely
        return (obs - self.low) / (self.high - self.low + 1e-8)


# ---------------------------------------------------------
# Create LunarLanderContinuous Environment (with or without normalization)
# ---------------------------------------------------------
def make_env(normalize=False):
    """
    Create the LunarLanderContinuous-v2 environment.
    If normalize=True, apply Min-Max Observation Normalization.
    """
    env = gym.make("LunarLanderContinuous-v3")

    if normalize:
        env = MinMaxObservationWrapper(env)

    env = Monitor(env)
    return env


# ---------------------------------------------------------
# Create PPO Model
# ---------------------------------------------------------
def create_model(env, lr, log_dir):

    model = PPO(
        "MlpPolicy",
        env,
        learning_rate=lr,
        n_steps=2048,    
        batch_size=64,
        gamma=0.99,
        gae_lambda=0.95,
        ent_coef=0.01,
        clip_range=0.2,
        vf_coef=0.5,
        max_grad_norm=0.5,
        verbose=1,
        tensorboard_log=log_dir,
    )

    logger = configure(log_dir, ["tensorboard"])
    model.set_logger(logger)

    model.logger.record("hyperparams/learning_rate", lr)
    model.logger.record("hyperparams/n_steps", 2048)
    model.logger.record("hyperparams/ent_coef", 0.01)

    return model


# ---------------------------------------------------------
# Train PPO
# ---------------------------------------------------------
def train_ppo(lr, timesteps, seed, normalize):

    set_seed(seed)
    env = make_env(normalize=normalize)

    timestamp = time.strftime("%Y%m%d-%H%M%S")
    norm_flag = "norm" if normalize else "raw"

    log_dir = f"logs/PPO_Lunar_{norm_flag}_lr_{lr}_seed_{seed}_{timestamp}"
    os.makedirs(log_dir, exist_ok=True)

    print("\n Training PPO on LunarLanderContinuous-v3")
    print(f"→ Learning Rate: {lr}")
    print(f"→ Seed: {seed}")
    print(f"→ Normalize: {normalize}")
    print(f"→ Logging to: {log_dir}\n")

    model = create_model(env, lr, log_dir)
    model.learn(total_timesteps=timesteps, progress_bar=True)

    model_filename = f"PPO_LunarLander_{norm_flag}_lr-{lr}_seed-{seed}_{timestamp}_{timesteps}steps.zip"
    model_path = os.path.join(log_dir, model_filename)
    model.save(model_path)

    print(f"\n Model saved to: {model_path}\n")

    env.close()


# ---------------------------------------------------------
# Demo PPO Model
# ---------------------------------------------------------
def run_demo(model_path, episodes=3):

    if not os.path.exists(model_path):
        print(f"\n Model not found: {model_path}\n")
        return

    print(f"\n Running Demo: {model_path}\n")

    # Demo is always run without normalization here.
    env = make_env(normalize=False)
    model = PPO.load(model_path)

    for ep in range(episodes):
        obs, _ = env.reset()
        done = False
        total_reward = 0

        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated

        print(f"Episode {ep+1} Reward = {total_reward}")

    env.close()


# ---------------------------------------------------------
# CLI with automatic multi-seed training
# ---------------------------------------------------------
if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("--train", action="store_true")
    parser.add_argument("--demo", action="store_true")

    parser.add_argument("--lr", type=float, default=3e-4)
    parser.add_argument("--timesteps", type=int, default=300_000)

    parser.add_argument("--seed", type=int, default=None)
    parser.add_argument("--model", type=str, default=None)
    parser.add_argument("--episodes", type=int, default=3)

    parser.add_argument("--normalize", action="store_true",
                        help="Use Min-Max normalization on observations.")

    args = parser.parse_args()

    if args.train:

        if args.seed is None:
            seeds = [1, 2]
            print(f"\n Running PPO with automatic multi-seed training: {seeds}\n")

            for s in seeds:
                print(f"\n========== TRAINING SEED {s} ==========\n")
                train_ppo(args.lr, args.timesteps, s, args.normalize)

        else:
            train_ppo(args.lr, args.timesteps, args.seed, args.normalize)

    elif args.demo:
        if args.model is None:
            print("\n ERROR: missing --model path\n")
        else:
            run_demo(args.model, args.episodes)

    else:
        print("\n Please specify --train or --demo.\n")

How to Run It

1. Without normalization (RAW observations)

It will train PPO on raw observations, without wrapper.

python ppo_lunarlander.py --train --lr 3e-4 --timesteps 300000

2. With Min-Max Normalization

It will apply the MinMaxObservationWrapper wrapper to normalize the observations to [0,1]. The rest is identical with the previous script.

python ppo_lunarlander.py --train --normalize --lr 3e-4 --timesteps 300000

3. DEMO

python ppo_lunarlander.py --demo --model &amp;lt;path&amp;gt;

COMMON QUESTIONS DEVELOPERS ASK ABOUT MIN-MAX SCALING

1. Min-max or Z-score in RL?

Imagine you have two types of friends:

  • some very short,
  • some very tall.

If you want everyone to fit on a tiny bus, you “squeeze” them to the same size. You can do this in two ways:

1.1 Min-Max

It’s like taking the smallest and the tallest child in the class and making them all between 0 and 1. But if a giant 2-meters-tall child comes along, the whole bus breaks down – everyone else becomes too small.

1.2 Z-Score

This is like saying, “Let’s see how far you are from the class average.” It works better if you have giant or dwarf children, because it doesn’t make the others “invisible.”

The simple conclusion

  • Min-Max = good when the values ​​are normal and you don’t have extremes,
  • Z-Score = good when you have crazy, very high or very low values.

Important: Stable-Baselines3 uses Z-Score automatically via VecNormalize.


2. Do I need to normalize the actions as well?

Imagine your robot has two motors:

  • one can push hard from −100 to +100,
  • the other only from −2 to +2.

If you give both raw commands, the robot will think: “The big motor is IMPORTANT, the small motor doesn’t matter.”

What do you do?

You put all the players on the same scale: normalized actions in [−1, +1]

This means that the robot learns correctly: “Both motors are important, they just have different strengths.”

Tags: Normalization
ShareTweetShareShareSend
Previous Post

The Complete Practical Guide to PPO with Stable-Baselines3

Next Post

Soft Actor Critic (SAC) Implementation In SB3 and PyTorch for Pendulum

Related Posts

What is Actor-Critic in Reinforcement Learning?
Deep RL Algorithms

What is Actor-Critic in Reinforcement Learning?

January 20, 2026
Exploration vs Exploitation in MDP
OpenAI Gymnasium

Exploration vs Exploitation in RL Explained with FrozenLake and DQN

February 27, 2026
Next Post
Soft Actor Critic (SAC) Implementation In SB3 and PyTorch for Pendulum

Soft Actor Critic (SAC) Implementation In SB3 and PyTorch for Pendulum

Reinforcement Learning Explained: The Complete Beginner’s Guide to How Machines Learn from Experiences

Reinforcement Learning Explained: The Complete Beginner’s Guide to How Machines Learn from Experiences

On the surface, they look similar, but "what" they learn and "how" they learn are very different.

Reinforcement Learning: Supervised, Unsupervised, or Something Else? (When to Use Each)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About the author

About Dragos Calin

Dragos Calin is a robotics engineer and reinforcement learning practitioner focused on building real-world autonomous and remote-controlled robotics for agriculture, edge-AI robotics, and embedded platforms. His work join simulation, machine learning, and hardware deployment, with a strong emphasis on practical, testable solutions that function outside the lab.

Areas of Expertise:

  • # Reinforcement Learning for Robotics
  • # Autonomous Agricultural Robots
  • # Embedded Systems & Edge AI (Jetson, Raspberry Pi, Arduino)
  • # Robotic Simulation & Sim2Real Workflow
  • # Sensor Fusion & Control Systems
  • # ROS-Based Robotics Development

Tags

Actor-Critic Bellman Equation Evaluation Metrics Exploitation Exploration Hyperparameter Tuning Machine Learning Markov Decision Process MDP MDP (Markov Decision Process) Normalization Partial Observability POMDP Q-Function Replay Buffer Temporal Difference TensorBoard
Newsletter

Subscribe Blog for Latest Updates

To stay updated with our newest projects and tutorials, make sure you subscribe to our newsletter. 

We do not share your information! You can subscribe  at any time. By subscribing you agree to our Privacy Policy.

Stay Tuned – Follow Us

To stay updated with our newest projects and tutorials, make sure you follow us on: Twitter / X

Site Information

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 Reinforcement Learning Path

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
      • CLASSIC DEEP RL APPLICATION
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3

© 2026 Reinforcement Learning Path