AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
No Result
View All Result

The Complete Guide of Learning Rate in RL 

by Dragos Calin
in RL Fundamentals
5
A A
0

When I tried to train my first Reinforcement Learning agent, the reward curve seemed to be going up. Then it started to oscillate. Then it crashed. I changed the algorithm. I changed the environment. I changed the number of seeds. I’ve changed almost everything… except for one thing: the Learning Rate(LR).

The same story is repeated for thousands of reinforcement learning (RL) practitioners who are looking for answers to questions like:

  • Why is the reward crashing?
  • Why is DQN exploding?
  • Why is PPO not learning anything today, even though it was working yesterday?
  • Why is training taking forever?

The questions are different, but the answer is almost always the same: because the chosen Learning Rate does not match the environment.

The illustration highlights the three possible situations: LR too high, too low, or balanced.
The illustration highlights the three possible situations: LR too high, too low, or balanced.

In RL, a learning rate that is too high amplifies both the gradient noise and the unpredictability of the environment. Too small an LR turns learning into a slow and frustrating stagnation. And unlike supervised learning, where a bad LR only slows down progress, in RL a bad LR can completely destroy the policy.

That’s why I wrote this tutorial. Not to repeat the theory you find in books. But to show you for real, with graphs, with comparisons, with complete code, what it looks like:

  • a too small LR,
  • a too large LR,
  • and a balanced LR.

In a few minutes you will read about:

  • how the reward curves for Q-Learning, DQN and PPO change,
  • why PPO is much more sensitive to LR than you think,
  • which values ​​are safe and which values ​​are dangerous,
  • what divergence looks like in TensorBoard,
  • how to test the optimal LR quickly, without guesswork.

Everything is tested. Everything is visual. Everything is explained simply.

By the end of this tutorial, choosing your learning rate will no longer be a lottery. It will be a practical, predictable skill that you will rely on in every RL experiment you do.

TABLE OF CONTENTS

  • PREREQUISITE
  • INTRODUCTION
  • THE ROLE OF LEARNING RATE IN RL ALGORITHMS
  • WHAT IS LEARNING RATE
  • HOW LEARNING RATE WORKS IN GRADIENT-BASED RL
  • WHY DOES LEARNING RATE AFFECT THE FINAL REWARD (NOT JUST THE LOSS)?
  • Q-LEARNING PRACTICAL EXPERIMENT 1: The Effect of Learning Rate on Convergence
  • DQN PRACTICAL EXPERIMENT 2: How LR Affects Neural Network Stability
  • PPO PRACTICAL EXPERIMENT 3: Why PPO Is More Sensitive to LR Than DQN
  • WHAT IS COMMON TO THE LEARNING RATE IN ML, DL AND RL
  • MAJOR MISTAKES BEGINNERS MAKE
  • HOW TO QUICKLY TEST LR OPTIM (STEP BY STEP PROCEDURE)
  • CONCLUSION

PREREQUISITE

PREREQUISITE
Prerequisite

Before starting the experiments, it’s important to make sure you have the basic RL toolkit installed. You don’t need to be a deep learning expert or have years of experience. This tutorial is designed to be useful whether you’re just starting out or already with knowledge using RL workflows.

All you need is a working Python environment with PyTorch, Gymnasium, TensorBoard and Stable-Baselines3. The exact tools I used for every experiment and graph in this guide.

If you’re missing any of them, I’ve already created a complete installation tutorial that sets everything up the right way on both Windows and Linux: Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium


INTRODUCTION

Learning Rate (LR) is a critical parameter in controlling the stability of all RL algorithms. And when it comes to learning, the stability of learning is everything.

Learning Rate (LR) is a critical parameter for stability.
Learning Rate is a critical parameter for stability.

If the LR value is poorly chosen, the policy becomes unstable. Whether we talk about DQN, PPO, Actor-Critic, SAC, all these algorithms depend on the LR.

Also, when we check whether our agent is learning, one of the key points is the Reward curve. This curve does not depend only on the chosen algorithm. It has a very strong dependency on the LR as well. Much stronger than someone might think.

Since I mentioned reward above, there is a strong connection between LR and reward noise.

  • Reward in RL is noisy, random, inconsistent,
  • The gradient is also noisy,
  • LR affects how you “amplify” or “filter” this noise,
  • LR too high –> you amplify chaos,
  • LR too low –> you learn almost nothing.

What is noise in Reinforcement Learning?

In RL there are two sources of noise:

1. Noise from the ENVIRONMENT

  • reward that varies randomly (stochastic rewards),
  • different transitions for the same action (stochastic transitions),
  • variations from exploration (ε-greedy, entropy, etc.),
  • different sequences of states –> irregular experience.

EXAMPLE: In CartPole, sometimes the agent receives a few extra “lucky” steps, other times it falls immediately.

2. Noise from the GRADIENT

This applies only to methods with neural networks.

  • gradients estimated on different minibatches,
  • high variance in policy gradient (PPO, A2C),
  • target networks that update periodically (DQN),
  • approximate Q-values –> error propagation.

RL is much more sensitive to LR than supervised learning. There is an essential difference between RL and ML:

  • In supervised learning, you have stable and “consistent” data,
  • In RL, you have unpredictable rewards and sequential dependence.
  • In supervised learning, a bad LR = slower training,
  • In RL, a bad LR = the policy is completely destroyed.

Plus, in RL there is also the phenomenon of delayed rewards which amplifies the inherent instability.

LR is one of the common elements between Q-Learning, DQN, PPO, Actor-Critic.

  • Q-Learning has α (tabular learning rate),
  • DQN has LR for the optimizer,
  • PPO has an LR for the policy and sometimes for the value function,
  • Actor-Critic as well.

Although these algorithms are completely different in mechanism, they all have an LR. It is one of the common parameters to different AI algorithms.

If you understand the Learning Rate, you already have an advantage in training RL agents, no matter how complex they are.


THE ROLE OF LEARNING RATE IN RL ALGORITHMS

In this section we will see where exactly the LR appears in each of the algorithms chosen as examples for this tutorial.

Where does LR appear in Q-Learning (α)

In tabular Q-Learning, the Learning Rate is called α (alpha) and it appears directly in the update formula:

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          Q(s, a) \leftarrow Q(s, a) + \alpha \cdot \left[ r + \gamma \cdot \max_{a'} Q(s', a') - Q(s, a) \right] \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • α controls how much we change the current Q-value,
  • it is the factor that decides “how much the agent learn” from the current observation,
  • it has no “gradients” because tabular Q-Learning does NOT use neural networks.

In other words:

  • α is the tabular form of the “learning rate”,
  • it appears in the Bellman formula,
  • if α is large –> Q-values jump chaotically,
  • if α is small –> the agent learns slowly.

Where does LR appear in DQN

In DQN, we no longer have a Q-table. We have a neural network that approximates Q-values.

The update is no longer done with the Bellman formula directly, but with gradient descent on a loss:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          \text{loss} = \left( Q_{\theta}(s, a) - \text{target} \right)^2 \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Here, the Learning Rate is:

  • lr from the optimizer (Adam, RMSProp, etc.),
  • it decides how much you update the network weights at each batch,
  • it is exactly the same concept as in normal deep learning.

To highlight the role of LR in DQN, I’ve show you the following list:

  • in DQN the lr is a parameter of the optimizer,
  • it controls the steps of gradient descent,
  • gradient noise + replay buffer –> DQN is very sensitive to LR,
  • LR too large = divergence,
  • LR too small = very slow training.

Where does LR appear in PPO

PPO is a policy gradient algorithm. That means we have two networks:

  • policy network,
  • value network.

Each is trained with gradient descent, so each has a learning rate in the optimizer. In PPO, LR affects:

  • how fast the policy changes,
  • how fast the value function adjusts.

More importantly, PPO is much more sensitive to LR than DQN because of the clipping objective:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          L^{\text{CLIP}}(\theta) \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Info: I will cover this clipping concept in PPO in another tutorial.

If LR makes the update too aggressive, the ratio between policies jumps over the allowed limit and results in policy collapse.

In conclusion:

  • PPO uses lr for the policy and sometimes a separate LR for the value net,
  • LR controls “how much” the agent changes the policy after each batch,
  • LR too large = exceeds the clip range and violates the surrogate objective,
  • PPO is one of the most sensitive algorithms to LR.

Why does the name differ (α vs η)?

In RL, the Learning Rate appears under different names only for historical and conceptual reasons, NOT because they are totally different things.

In tabular Q-Learning –> α (alpha):

  • algebraic concept, directly in the Bellman formula,
  • it is a learning constant per update.

In gradient descent / neural networks → η (eta) or lr:

  • it is the step size of the gradient update,
  • it comes from numerical optimization (calculus).

The name difference reflects:

  • the different way learning happens: directly on the Q-table vs on weights,
  • the domain from which the formulas come: tabular RL vs deep learning.

But the concept is the same: how much you change what you currently know based on the new information.

In other words:

  • α = tabular LR (used in the Bellman equation),
  • η = LR in gradient descent,
  • different implementations, identical concept,
  • that is why we can talk about “LR” as being a common element across all RL algorithms.

WHAT IS LEARNING RATE

In this section, I’ll explain the learning rate and build an intuition for how it controls the speed and stability of learning.

Learning Rate is a single number that controls how big the learning steps are during training.

Every update – whether it comes from a Q-value difference, a gradient descent step, or a policy improvement – is multiplied by this value.

At its core, the learning rate answers a simple question:

“How much should the agent change its knowledge after seeing new information?”

A large learning rate means big steps.

A small learning rate means small steps.

And the right learning rate means stable and efficient learning.

Mathematical intuition (general, not RL-specific)

The general form of a learning update looks like this:

     \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          \theta \leftarrow \theta - \eta \nabla L(\theta) \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • θ are the parameters (weights, values, or policies),
  • η (eta) is the learning rate,
  • ∇L(θ) is the gradient or signal for how to change θ.

The learning rate simply scales the update. It does not change the direction of learning, only the amount.

Visual intuition

The curves are valid for ALL algorithms based on GRADIENT / CONTINUOUS OPTIMIZATION DQN, PPO, SAC, TD3, Actor–Critic / A2C
The curves are valid for ALL algorithms based on GRADIENT / CONTINUOUS OPTIMIZATION DQN, PPO, SAC, TD3, Actor–Critic / A2C

Even without specific algorithms, every learning system tends to produce one of three curves:

1. LR too small –> slow, smooth curve

  • stable,
  • safe,
  • but progresses painfully slow.

2. LR too large –> chaotic, unstable curve

  • oscillations,
  • divergence,
  • unpredictable updates.

3. LR optimal –> fast and stable curve

  • converges quickly,
  • consistent,
  • predictable.

These three shapes are universal, they appear in deep learning, Q-Learning, DQN, PPO, and every RL algorithm.


HOW LEARNING RATE WORKS IN GRADIENT-BASED RL

RL ≠ ML

In RL the gradient comes from the reward, which is noisy:

  • random,
  • inconsistent,
  • trajectory-dependent,
  • dependent on previous policies.

In supervised machine learning(ML):

  • the gradient comes from the loss between prediction and ground truth,
  • the data is fixed, clean, consistent,
  • the gradient has low variance.

This explains why the RL gradient has much higher variance than the ML gradient. Because of this, LR has a much bigger effect on stability.

RL amplifies noise. Wrong LR = divergence

Here I want to explain an essential concept, namely that RL amplifies noise because:

  • reward noise –> affects gradient updates,
  • policy noise (exploration) –> produces different experiences,
  • environment noise –> produces unpredictable transitions,
  • bootstrap errors –> propagate over time (Q-learning, DQN).

In supervised learning, noise affects only a single batch. In RL, noise affects the entire policy for tens or hundreds of steps. Thus, if LR is too large, the multiplied noise leads to total divergence. If LR is too small, the noise is dampened, but the agent learns extremely slowly.


WHY DOES LEARNING RATE AFFECT THE FINAL REWARD (NOT JUST THE LOSS)?

In reinforcement learning, the learning rate does not only influence the loss function. It directly controls the reward stability. A wrong LR value will destabilize the reward curve long before it affects the loss.

Unlike supervised learning, where updates are independent, RL learns from sequences of states, actions, and rewards. This means that any instability introduced by an aggressive learning rate propagates from one step to the next.

A single unstable update changes the policy, which changes the trajectory, which changes the rewards, which changes the next update. It creates a chain reaction. This explains why many beginners ask the same question: “why is my reward unstable?”

Learning rate acts as the main stability control knob in RL.

  • If LR is too high, it amplifies noise from the environment and noise from the gradient, causing the policy to jump unpredictably,
  • If LR is too low, the agent filters too much of the signal and learns extremely slowly.

RL contains two independent sources of noise, which makes LR even more critical:

  • Environment noise (stochastic rewards, random transitions, exploration randomness),
  • Gradient noise (policy gradient variance, replay buffer sampling noise, bootstrap errors).

This is why the learning rate affects not only the loss, but also the final reward, reward stability, and ultimately the success of the entire RL training process.


Q-LEARNING PRACTICAL EXPERIMENT 1: The Effect of Learning Rate on Convergence

I’ve run the same experiment 3 times, changing only α: 0.1, 0.5, 0.9.

All other parameters were identical:

  • epsilon decay normal,
  • discount factor γ 0.99,
  • fixed discretization,
  • 15,000 episodes.

There are a few conclusions we can draw from these runs. MountainCar is a learning rate sensitive environment when using tabular Q-Learning. This is due to the fact that:

  • reward is sparse (−1 each step),
  • you have to accumulate Q-values ​​over long paths,
  • there are many “useless” states,
  • initial trajectories are chaotic,
  • Q-values ​​propagate slowly in the discretized space.

Learning Rate (α) tells how much you change today’s grade from the old grade.

if α = 0.1 (small)

  • you change the notes little by little,
  • it’s like writing with a pencil very lightly, so you don’t make mistakes,
  • you learn slowly, but you learn steadily.

if α = 0.5 (high)

  • you change your grades a lot every episode,
  • you learn quickly… but you can make mistakes quickly,
  • if the rewards are different from one episode to another, the values ​​jump chaotically.

if α = 0.9 (very high)

  • you erase everything you knew and write almost only what happened “now“,
  • you learn super fast, but you fool yourself very quickly,
  • result: the graph remains stuck, because the agent fails to retain what it learns.

The learning curves for the 3 runners look like the image below:

Learning rate for Q-Learning with three different values: 0.1, 0.5, 0.9
Learning rate for Q-Learning with three different values: 0.1, 0.5, 0.9

In the graph above:

  • the orange curve (α = 0.1) grows nicely, slowly and steadily,
  • the blue and gray curves (α = 0.5 and 0.9) almost do NOT grow at all,
  • the reward remains at very low values ​​(−200), meaning the agent has not learned anything.

Why does the agent learn almost nothing at α = 0.5 and 0.9?

Because when you try to learn something, you don’t change the rules every minute. If you change your mind too quickly, your brain gets confused. That’s exactly what the RL agent does with α = 0.5 and α = 0.9.

CONCLUSION

  • Q-Learning directly updates Q-values, without neural networks,
  • When α is too large, the updates are crude and destroy learning,
  • When α is small, Q-values ​​grow slowly but remain stable,
  • Therefore α controls convergence much more directly than the learning rate in DQN or PPO.

DQN PRACTICAL EXPERIMENT 2: How LR Affects Neural Network Stability

As in the Q-Learning experiment, I’ve run the MountainCar training 3 times, changing only the learning rate.

  • Low learning rate: 1e-5
  • Medium LR: 5e-4
  • High LR: 5e-3

All other parameters were identical:

  • buffer_size 50000
  • learning_starts 1000
  • batch_size 64
  • gamma 0.99
  • timesteps 250000

Imagine that DQN is a child learning to push a toy car up a hill. The learning rate tells us how much we change the child’s brain after each attempt.

if LR is small – 0.00001 (1e-5)

This is the only one who learns.

  • Every time the child makes a mistake or succeeds, the algorithm adjust the brain in very small steps, like a screw being slowly tightened,
  • At first it seems like nothing is happening, but after many small attempts, the child begins to understand: “This is how I need to accelerate to go up the hill,”
  • The result: the average reward begins to increase and the car goes higher and higher.

Here, the small steps are stable enough not to destroy everything they have learned so far.

if LR is average – 0.0005 (5e-4)

Sounds reasonable, but in our experiment it doesn’t help at all.

  • We try to change the child’s brain in larger steps,
  • Because of the noise in RL (variable reward, different transitions, different experiences), these changes become too harsh,
  • What happens? Every time it learns something small, the next update breaks what it just learned.

The result: the reward stays low, almost straight-line. The agent fails to get out of the “constant failure” state.

if LR is large – 0.005 (5e-3)

Here the steps are even larger.

  • The agent tries to learn “by force”,
  • The gradient is very noisy, and large updates make the neural network unstable,
  • Instead of getting closer to a solution, the agent keeps breaking and rewriting what it knows,
  • Result: the average reward is still a flat line almost identical to LR = 0.0005. The agent does not progress.

The graph below we can see how the average reward per episode changes during training, for the three LR values.

Learning rate for DQN with three different values: 0.00001, 0.005, 0.0005
Learning rate for DQN with three different values: 0.00001, 0.005, 0.0005

Red line (LR = 0.00001 / 1e-5)

  • It is low at the beginning (around −200), like the others,
  • After ~100k timesteps a slow climb begins,
  • As time passes, the reward increases more and more –> the agent actually learns to play MountainCar,
  • The curve has a clear upward shape towards the end.

Line for LR = 0.0005 (5e-4)

  • It remains almost perfectly flat, around −200,
  • We see no clear upward trend,
  • This means that the agent continues to fail. Tt fails to improve the strategy.

Line for LR = 0.005 (5e-3)

  • It is almost overlapping with the line for 0.0005,
  • Visually, the two appear to be a single line,
  • The message is clear: with these two LRs, the DQN agent does not learn at all, regardless of whether the LR is “medium” or “high”.

Small steps, repeated many times, manage to filter out the noise and extract a useful signal.

DQN is very sensitive to learning rate. Unlike tabular Q-Learning, where an average LR can be ok, in DQN the LR values ​​must be chosen much more conservatively, because:

  • the gradient is estimated from noisy data,
  • deep neural networks are used,
  • the bootstrap (target Q) also introduces errors.

I’ve included the full Python code for DQN below, so you can run all the experiments yourself and see how the learning rate changes in practice.

"""
DQN Training and Demo Script for MountainCar-v0
------------------------------------------------
Train or test a Deep Q-Learning agent using Stable Baselines3 (SB3).
TRAINING:
    python dqn_mountaincar.py --train --lr 1e-5 --timesteps 250000    -> low
    python dqn_mountaincar.py --train --lr 5e-4 --timesteps 250000    -> medium
    python dqn_mountaincar.py --train --lr 5e-3 --timesteps 250000    -> high

DEMO:
    python dqn_mountaincar.py --demo --model DQN_MountainCar_lr-0.0001_seed-3_20250114-154530_250000steps.zip

Author: Calin Dragos George
Created: 15 November 2025
"""
import argparse
import os
import time
import gymnasium as gym
import numpy as np
import torch

from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure


# ---------------------------------------------------------
# Utility: Set seeds for full reproducibility
# ---------------------------------------------------------
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


# ---------------------------------------------------------
# Environment Creator
# ---------------------------------------------------------
def make_env():
    env = gym.make("MountainCar-v0")
    env = Monitor(env)
    return env


# ---------------------------------------------------------
# Create DQN Model with dynamic learning rate
# ---------------------------------------------------------
def create_model(env, lr, log_dir):

    model = DQN(
        "MlpPolicy",
        env,
        learning_rate=lr,
        buffer_size=50000,
        learning_starts=1000,
        batch_size=64,
        tau=1.0,
        gamma=0.99,
        train_freq=1,
        gradient_steps=1,
        target_update_interval=1000,
        exploration_fraction=0.3,
        exploration_final_eps=0.05,
        verbose=1,
        tensorboard_log=log_dir,
    )

    # Custom logger (TensorBoard)
    logger = configure(log_dir, ["tensorboard"])
    model.set_logger(logger)

    # Log hyperparameters
    model.logger.record("hyperparams/learning_rate", lr)
    model.logger.record("hyperparams/gamma", 0.99)
    model.logger.record("hyperparams/exploration_fraction", 0.3)

    return model


# ---------------------------------------------------------
# Train DQN
# ---------------------------------------------------------
def train_dqn(lr, timesteps, seed):

    set_seed(seed)

    timestamp = time.strftime("%Y%m%d-%H%M%S")

    # log directory unique for each run
    log_dir = f"DQN/logs/DQN_lr_{lr}_seed_{seed}_{timestamp}"
    os.makedirs(log_dir, exist_ok=True)

    env = make_env()

    print("\n Training DQN on MountainCar-v0")
    print(f" Learning Rate: {lr}")
    print(f" Seed: {seed}")
    print(f" Logging to: {log_dir}\n")

    model = create_model(env, lr, log_dir)
    model.learn(total_timesteps=timesteps, progress_bar=True)

    # build model filename
    model_filename = f"DQN_MountainCar_lr-{lr}_seed-{seed}_{timestamp}_{timesteps}steps.zip"
    model_path = os.path.join(log_dir, model_filename)

    # save model
    model.save(model_path)

    print(f"\n Model saved to: {model_path}\n")

    env.close()


# ---------------------------------------------------------
# Demo using a chosen model file
# ---------------------------------------------------------
def run_demo(model_path, episodes=3):

    if not os.path.exists(model_path):
        print(f"\n ERROR: Model file not found: {model_path}\n")
        return

    print(f"\n Running Demo with model: {model_path}\n")

    env = make_env()
    model = DQN.load(model_path)

    for ep in range(episodes):
        obs, _ = env.reset()
        done = False
        total_reward = 0

        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated

        print(f"Episode {ep+1} reward: {total_reward}")

    env.close()


# ---------------------------------------------------------
# CLI
# ---------------------------------------------------------
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--train", action="store_true")
    parser.add_argument("--demo", action="store_true")

    parser.add_argument("--lr", type=float, default=1e-4)
    parser.add_argument("--timesteps", type=int, default=200_000)
    parser.add_argument("--seed", type=int, default=None)   

    parser.add_argument("--model", type=str, default=None)
    parser.add_argument("--episodes", type=int, default=3)

    args = parser.parse_args()

    # ---------------------------------------------------------
    # MULTI-SEED TRAINING LOGIC
    # ---------------------------------------------------------
    if args.train:

        if args.seed is None:
            # Run 5 seeds automatically
            seeds = [1, 2]
            print(f"\nRunning automatic multi-seed training: {seeds}\n")

            for s in seeds:
                print(f"\n========== RUNNING SEED {s} ==========\n")
                train_dqn(args.lr, args.timesteps, s)

        else:
            # Run just one seed
            train_dqn(args.lr, args.timesteps, args.seed)

    elif args.demo:
        if args.model is None:
            print("\n ERROR: You must specify a model with --model path\n")
        else:
            run_demo(args.model, args.episodes)

    else:
        print("\n Please specify either --train or --demo.\n")


PPO PRACTICAL EXPERIMENT 3: Why PPO Is More Sensitive to LR Than DQN

As in the other two experiments, I’ve run the MountainCar training 3 times, changing only the learning rate.

  • Low learning rate: 1e-5
  • Medium LR: 3e-4
  • High LR: 1e-3

All other parameters were identical:

  • n_steps 512         
  • batch_size 64
  • gamma 0.99
  • gae_lambda 0.95
  • ent_coef 0.01         
  • clip_range 0.2

In PPO, the agent learns through trial and error, but it has a special mechanism called “clipping” that protects it from taking too big steps. However, if the learning rate is too high or too low, clipping cannot save it.

Imagine that PPO is a child trying to climb a hill on an electric bike, and the learning rate controls how much “current” the bike sends to the wheel:

if LR is large – 0.001 (1e-3)

Fastest learning, but with slight instability.

  • The agent receives strong impulses and quickly learns how to accelerate and reach the top,
  • The reward increases very early (approx. after 70k timesteps),
  • It stabilizes in a very good area: ~–130 to –140,
  • It has small oscillations, because large steps can destabilize the policy a little, but PPO clipping keeps it within limits.

Think of a child who presses the accelerator quite hard, goes fast, but sometimes staggers.

if LR is medium – 0.0003 (3e-4)

Learn well, but slower and more stable than high LR.

  • The agent starts slowly, but after ~150k timesteps the reward starts to grow nicely,
  • Later it climbs continuously and reaches a level similar to high LR, but stronger and more stable,
  • The curve is smooth, without big oscillations.

It is as if the child starts slowly, but without risks. In the end it climbs the hill almost as well, just later.

if LR is small – 0.00001 (1e-5)

It learns almost nothing.

  • The reward stays around –200 all the time,
  • The agent takes such small steps in updating the policy that, in practice, it changes almost nothing,
  • PPO clipping has nothing to protect against, because the steps are too small to matter.

It’s like a child riding a bike without power to the wheel. He can’t go up at all.

The graph below shows us how well the PPO agent is doing based on learning rate:

Learning rate for PPO with three different values: 0.001, 0.0003, 0.00001
Learning rate for PPO with three different values: 0.001, 0.0003, 0.00001

Green line – LR = 0.001 (1e-3)

  • It climbs quickly after 60–80k timesteps,
  • It stabilizes around –130 at the end,
  • It is the fastest curve –> the agent learns quickly,
  • It has small oscillations –> a sign that large steps make the PPO a little unstable.

Red line – LR = 0.0003 (3e-4)

  • It climbs slower, but is stable and constant,
  • It starts to grow seriously after ~150k timesteps,
  • Towards the end it reaches almost the same place as the green line,
  • It is “cleaner” and more stable.

Blue line – LR = 0.00001 (1e-5)

  • It is perfectly flat,
  • The reward does not change at all,
  • The agent gets stuck at –200 (fail),
  • PPO cannot learn with such a small LR.

PPO is more tolerant to high LR values ​​than DQN, because:

  • it uses policy gradient,
  • it has clipping,
  • values ​​are normalized by advantage,
  • updates are regular by mini-batches.

But even so, PPO cannot learn with a learning rate that is too low. The reward remains completely flat.

I’ve included the full Python code for PPO below, so you can run all the experiments yourself and see how the learning rate changes in practice.

"""
PPO Training and Demo Script for MountainCar-v0
-----------------------------------------------
Train or test a PPO agent using Stable Baselines3.

TRAINING EXAMPLES:
    python ppo_mountaincar.py --train --lr 1e-5 --timesteps 250000    -> low
    python ppo_mountaincar.py --train --lr 3e-4 --timesteps 250000    -> medium
    python ppo_mountaincar.py --train --lr 1e-3 --timesteps 250000    -> high

DEMO:
    python ppo_mountaincar.py --demo --model PPO_MountainCar_lr-0.0001_seed-3_xxx.zip

Author: Calin Dragos George
Created: 15 November 2025
"""

import argparse
import os
import time
import gymnasium as gym
import numpy as np
import torch

from stable_baselines3 import PPO
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure


# ---------------------------------------------------------
# Utility: Set seeds for reproducibility
# ---------------------------------------------------------
def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False


# ---------------------------------------------------------
# Create MountainCar Environment
# ---------------------------------------------------------
def make_env():
    env = gym.make("MountainCar-v0")
    env = Monitor(env)
    return env


# ---------------------------------------------------------
# Create PPO Model (customized for MountainCar)
# ---------------------------------------------------------
def create_model(env, lr, log_dir):

    model = PPO(
        "MlpPolicy",
        env,
        learning_rate=lr,
        n_steps=512,            
        batch_size=64,
        gamma=0.99,
        gae_lambda=0.95,
        ent_coef=0.01,          
        clip_range=0.2,
        vf_coef=0.5,
        max_grad_norm=0.5,
        verbose=1,
        tensorboard_log=log_dir,
    )

    logger = configure(log_dir, ["tensorboard"])
    model.set_logger(logger)

    model.logger.record("hyperparams/learning_rate", lr)
    model.logger.record("hyperparams/n_steps", 512)
    model.logger.record("hyperparams/ent_coef", 0.01)

    return model


# ---------------------------------------------------------
# Train PPO
# ---------------------------------------------------------
def train_ppo(lr, timesteps, seed):

    set_seed(seed)
    env = make_env()

    timestamp = time.strftime("%Y%m%d-%H%M%S")
    log_dir = f"PPO/logs/PPO_lr_{lr}_seed_{seed}_{timestamp}"
    os.makedirs(log_dir, exist_ok=True)

    print("\n Training PPO on MountainCar-v0")
    print(f"→ Learning Rate: {lr}")
    print(f"→ Seed: {seed}")
    print(f"→ Logging to: {log_dir}\n")

    model = create_model(env, lr, log_dir)
    model.learn(total_timesteps=timesteps, progress_bar=True)

    model_filename = f"PPO_MountainCar_lr-{lr}_seed-{seed}_{timestamp}_{timesteps}steps.zip"
    model_path = os.path.join(log_dir, model_filename)
    model.save(model_path)

    print(f"\n Model saved to: {model_path}\n")

    env.close()


# ---------------------------------------------------------
# Demo PPO Model
# ---------------------------------------------------------
def run_demo(model_path, episodes=3):

    if not os.path.exists(model_path):
        print(f"\n Model not found: {model_path}\n")
        return

    print(f"\n Running Demo: {model_path}\n")

    env = make_env()
    model = PPO.load(model_path)

    for ep in range(episodes):
        obs, _ = env.reset()
        done = False
        total_reward = 0

        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            total_reward += reward
            done = terminated or truncated

        print(f"Episode {ep+1} Reward = {total_reward}")

    env.close()

# ---------------------------------------------------------
# CLI with automatic multi-seed training
# ---------------------------------------------------------
if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument("--train", action="store_true")
    parser.add_argument("--demo", action="store_true")

    parser.add_argument("--lr", type=float, default=3e-4)
    parser.add_argument("--timesteps", type=int, default=300_000)

    # IMPORTANT: None => run 5 seeds automatically
    parser.add_argument("--seed", type=int, default=None)

    parser.add_argument("--model", type=str, default=None)
    parser.add_argument("--episodes", type=int, default=3)

    args = parser.parse_args()

    # ---------------------------------------------------------
    # MULTI-SEED LOGIC
    # ---------------------------------------------------------
    if args.train:

        if args.seed is None:
            # Run 5 seeds automatically
            seeds = [1, 2]
            print(f"\n Running PPO with automatic multi-seed training: {seeds}\n")

            for s in seeds:
                print(f"\n========== TRAINING SEED {s} ==========\n")
                train_ppo(args.lr, args.timesteps, s)

        else:
            # Run a single seed
            train_ppo(args.lr, args.timesteps, args.seed)

    elif args.demo:
        if args.model is None:
            print("\n ERROR: missing --model path\n")
        else:
            run_demo(args.model, args.episodes)

    else:
        print("\n Please specify --train or --demo.\n")

WHAT IS COMMON TO THE LEARNING RATE IN ML, DL AND RL

This section answers the question:

What is common between the learning rate in machine learning, deep learning, and reinforcement learning?

The purpose of the section is to explain that although the algorithms are different (ML –> DL –> RL), the concept of the learning rate is identical in principle.

1. Gradient descent is the same concept

Whether we talk about ML, DL, or RL (that is, methods with neural networks), parameter updates are performed through gradient descent or its variants (SGD, Adam, RMSProp, etc.).

I wrote about gradient descent in another tutorial. For a better understanding of this concept, you can read the tutorial here: Gradient Descent.

The difference lies in where the gradient is applied:

  • in ML: on a simple model (linear regression, logistic regression),
  • in DL: on a neural network,
  • in RL: on a Q-network (DQN), on the policy (PPO), on the value network, etc.

2. LR controls stability, speed, and noise

Regardless of the domain:

  • low LR = slow learning
  • high LR = instability
  • optimal LR = speed + stability

These characteristics are universal for the learning rate in:

  • machine learning,
  • deep learning: CNN, RNN, Transformers,
  • neural Q-learning, DQN, PPO.

3. Learning rate schedulers are similar

Schedulers that modify LR over time include:

  • Step decay,
  • Exponential decay,
  • Cosine annealing,
  • Cyclical LR,
  • One-cycle policy,
  • ReduceLROnPlateau.

They appear in:

  • deep learning (Keras, PyTorch, TensorFlow),
  • advanced machine learning,
  • reinforcement learning based on neural networks (DQN, PPO, A2C).

In conclusion, the schedulers are the same, only applied in different contexts.


MAJOR MISTAKES BEGINNERS MAKE

1. Choosing the LR “by intuition”

Beginners choose LR:

  • because “it worked in a tutorial”,
  • because “1e-3 seems like a standard value”,
  • because “someone on Reddit used 0.1”,
  • because “it’s okay in DL”.

But in RL, the same value can produce:

  • reward instability,
  • policy collapse,
  • complete divergence.

The biggest learning rate mistake is choosing LR without experiment and without graphs.

2. Reducing the LR too late

In RL, if the beginner notices instability, they usually:

  • let the training continue hoping “it will stabilize”,
  • or reduce the LR after the model has already destroyed the policy.

This is a frequently encountered error. Once the policy deteriorates because of a too-large LR, many episodes are wasted, and sometimes the agent never recovers.

3. Large LR in RL

In DL, a large LR can sometimes be tolerated for a few epochs. But a large LR:

  • amplifies reward noise,
  • destroys the policy,
  • produces Q-value explosion,
  • produces reward collapse.

It is the main cause of the classic question: “why is my model not learning?”

4. Lack of graph analysis

Beginners do not look at:

  • reward curve,
  • loss curve (DQN, PPO),
  • entropy (PPO),
  • value function predictions,
  • Q-value magnitude.

Without graph analysis, LR seems “magical” and hard to understand.

5. Lack of environment resetting

Here I want to explain something important for the case where the environment is not reset properly:

  • the agent starts episodes in strange states,
  • the reward becomes unpredictable,
  • the gradient becomes useless,
  • LR appears “wrong” even if it is not.

HOW TO QUICKLY TEST LR OPTIM (STEP BY STEP PROCEDURE)

In this section, I will explain a practical, fast, and concrete method to find “the best learning rate” without wasting hours testing random values.

This is a practical tool, a tested procedure, inspired by the idea of Leslie Smith (creator of cyclical learning rates, 1cycle policy).

1. The range test method inspired by Leslie Smith

Instead of trying random values like 1e-5, 1e-4, 1e-3, 1e-2, you use a systematic test:

Range Test / LR Finder Method

  1. you start with a very small LR
  2. you increase the LR exponentially at each batch
  3. you measure reward or loss
  4. you observe where:
    • learning starts
    • learning is optimal
    • learning explodes

This is exactly the method used in PyTorch Lightning, fast.ai, and many modern libraries. It is the professional way to answer the question: how to choose the learning rate?

2. Test with DQN or PPO

Here the idea is simpler:

  • Tabular Q-Learning does not use gradient descent, so it is not suitable for LR Finder,
  • DQN and PPO use gradient descent, so the range test works perfectly.

Thus:

  • for DQN: you track loss + Q-value explosion,
  • for PPO: you track loss + reward stability + entropy.

3. Observing the graph: arrow, plateau, divergence

Here we interpret the result of the LR range test:

3.1 Arrow region: the graph increases slowly –> LR too small –> slow learning.

3.2 Plateau region:

  • reward increases steadily –> optimal LR,
  • loss decreases consistently –> ideal LR.

3.3 Divergence region:

  • reward jumps chaotically,
  • loss explodes,
  • Q-values become unstable.

This is exactly the visual signature of a learning rate that is too large.


CONCLUSION

The learning rate is the size of the steps the agent’s “brain” takes when learning.

If the step is too big, the agent jumps around chaotically and doesn’t learn anything.

If the step is too small, the agent barely moves and still doesn’t learn.

In RL:

  • Q-Learning: very big steps make it skip over solutions, and very small steps make it slow as a turtle,
  • DQN: it has a neural network brain, so it’s super sensitive. Only a small and stable learning rate makes it learn,
  • PPO: it can use slightly bigger steps, but if they’re too small, nothing happens, and if they’re too big, the policy goes crazy.

The right learning rate is like the right speed on a bicycle: if you pedal too hard you fall, if you pedal too slow you are not moving optimally.

Find the right speed and the agent learns beautifully.

Tags: Evaluation MetricsHyperparameter Tuning
ShareTweetShareShareSend
Previous Post

Deep Q-Learning – Build, Train, and Visualize with PyTorch, Gymnasium, and SB3

Next Post

The Complete Practical Guide to PPO with Stable-Baselines3

Related Posts

What is Actor-Critic in Reinforcement Learning?
Deep RL Algorithms

What is Actor-Critic in Reinforcement Learning?

January 20, 2026
Exploration vs Exploitation in MDP
OpenAI Gymnasium

Exploration vs Exploitation in RL Explained with FrozenLake and DQN

February 27, 2026
Next Post
The six pillars of PPO stability and performance in Stable-Baselines3

The Complete Practical Guide to PPO with Stable-Baselines3

Hands-On: Min-Max Normalization In Action

Hands-On: Min-Max Normalization In Action

Soft Actor Critic (SAC) Implementation In SB3 and PyTorch for Pendulum

Soft Actor Critic (SAC) Implementation In SB3 and PyTorch for Pendulum

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About the author

About Dragos Calin

Dragos Calin is a robotics engineer and reinforcement learning practitioner focused on building real-world autonomous and remote-controlled robotics for agriculture, edge-AI robotics, and embedded platforms. His work join simulation, machine learning, and hardware deployment, with a strong emphasis on practical, testable solutions that function outside the lab.

Areas of Expertise:

  • # Reinforcement Learning for Robotics
  • # Autonomous Agricultural Robots
  • # Embedded Systems & Edge AI (Jetson, Raspberry Pi, Arduino)
  • # Robotic Simulation & Sim2Real Workflow
  • # Sensor Fusion & Control Systems
  • # ROS-Based Robotics Development

Tags

Actor-Critic Bellman Equation Evaluation Metrics Exploitation Exploration Hyperparameter Tuning Machine Learning Markov Decision Process MDP MDP (Markov Decision Process) Normalization Partial Observability POMDP Q-Function Replay Buffer Temporal Difference TensorBoard
Newsletter

Subscribe Blog for Latest Updates

To stay updated with our newest projects and tutorials, make sure you subscribe to our newsletter. 

We do not share your information! You can subscribe  at any time. By subscribing you agree to our Privacy Policy.

Stay Tuned – Follow Us

To stay updated with our newest projects and tutorials, make sure you follow us on: Twitter / X

Site Information

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 Reinforcement Learning Path

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
      • CLASSIC DEEP RL APPLICATION
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3

© 2026 Reinforcement Learning Path