AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
        • Types of Reinforcement Learning
        • 1 Mathematical Foundations
          • 1.1 Vectors
          • 1.2 Derivatives
          • 1.3 Gradients
          • 1.4 Spaces
          • 1.5 Normalization
          • 1.6 Function Approximation
        • 2 Core RL Concepts
          • 2.1 Problem Classification
          • 2.2 Bellman Equation
          • 2.3 Model Free Learning
          • 2.4 Reward Shaping
          • 2.5 On-Policy vs Off-Policy Learning
          • 2.6 Agent
          • 2.7 Markov Decision Process(MDP)
        • 3 Learning Strategies
          • 3.1 Choosing RL Algorithm
          • 3.2 Epsilon-greedy
          • 3.3 SIM2REAL
          • 3.4 Experience Replay
          • 3.5 Curriculum Learning
          • 3.6 Isaac Sim
        • 4 Deep RL Techniques
          • 4.1 Backpropagation
          • 4.2 Weight Initialization
          • 4.3 Gradient Descent
          • 4.4 ReLU Activation Function
          • 4.5 Artificial Neuron
          • 4.6 Adam Optimization
          • 4.7 Convolutional Neural Network
        • 5 RL Algorithms
          • Q-Learning
          • Deep Q Network (DQN) – Formula and Explanation
          • Double DQN
          • Dueling DQN
          • Proximal Policy Optimization (PPO)
          • Soft Actor-Critic (SAC)
      • CLASSIC DEEP RL APPLICATION
        • PART 1: Deep RL with DQN and CNN
        • PART 2: Problem Definition
        • PART 3: Markov Decision Process (MDP)
        • PART 4: Choosing the Algorithm
        • PART 5: Environment + RL Model + Reward Function
        • PART 6: Training + Testing + Google Colab Access
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3
No Result
View All Result
AI Robotics: Tutorials, Practical Reinforcement Learning, and Real-World Control
No Result
View All Result

Deep Q-Learning – Build, Train, and Visualize with PyTorch, Gymnasium, and SB3

by Dragos Calin
in Deep RL Algorithms, DQN, PyTorch, Q-Learning, RL Fundamentals, Stable-Baselines3, Tools, Code & Experiment Design
4
A A
0

In this tutorial, I’ll show you how to build the brain of a DQN agent, train it to master MountainCar, and finally watch it learn. All these steps use PyTorch, Gymnasium, and Stable Baselines3 – forming a complete, reusable DQN pipeline.

There’s a little agent, let’s call it the DQN Agent.
It struggles to climb a hill. It tries once, twice, a thousand times.
But after thousands of failures, something changes.
It learns how to accelerate exactly when it should. It learns to use gravity to its advantage. It learns purely from feedback. No instructions, no teacher.
Just experience.

That’s exactly what Deep Q-Learning does. And in this tutorial, you’ll build your own agent that learns this skill entirely from scratch.

Table of Contents

  • STEP 1: Environment Setup – Installing PyTorch, Gymnasium, and Stable Baselines3
  • STEP 2: Understanding Deep Q-Learning
  • STEP 3: The MountainCar Problem
  • STEP 4: Building the DQN Agent with Stable Baselines3
  • STEP 5: Training the Agent and Tracking Progress
  • STEP 6: Running the Trained Agent (Demo Mode)
  • Generalizing to Other Environments
  • Key Takeaways and Next Steps

STEP 1: Environment Setup – Installing PyTorch, Gymnasium, and Stable Baselines3

Before we start building the first DQN agent, we need to make sure the environment is ready.

Our agent will need three essential tools:

  • PyTorch, for building and running the neural network that learns,
  • Gymnasium, to provide the MountainCar environment,
  • Stable Baselines3, to manage the DQN training process.

If you haven’t installed them yet, I’ve written a detailed guide that covers every step of installation. The tutorial includes the installation steps for both Windows and Linux, and how to verify that everything works correctly.

You can follow it here: Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium

Once your environment is ready, come back here and let’s build the DQN pipeline together.

If you already have your environment set up, make sure you have the following versions (or newer):

python --version        # >= 3.10  

To check that everything works, try running:

import torch, gymnasium, stable_baselines3
print('Setup complete!')

The result should display something like:

Check environment setup for PyTorch, Gymnasium, and Stable Baselines3
Check environment setup for PyTorch, Gymnasium, and Stable Baselines3

Now that the setup and environment is ready, it’s time to understand how Deep Q-Learning actually works and what makes it different from classical Q-Learning. Let’s see what’s happening inside the agent’s brain.


STEP 2: Understanding Deep Q-Learning

If we’ve reached the second step, it means that the setup with libraries, frameworks, and environment is ready to use.

The goal of step 2 is to understand how Deep Q-Learning works. In most cases, you’ll also find Deep Q-Learning referred to as Deep Q-Network (DQN).

In traditional Q-Learning, we map every state-action pair to a value and store them into a Q-table.

Deep Q-Learning replaces that table with a neural network that estimates Q-values for each possible action. This method allows the agent to generalize across continuous or high-dimensional environments such as continuous state spaces (e.g., positions, velocities, angles, pixel observations).

To train this network, DQN introduces three key components:

  • Policy Network and Target Network – one for learning, one for stability.
  • Replay Buffer – a memory that stores past experiences and helps the agent learn from them more efficiently.
  • Bellman Equation in Neural Form –  the core update rule that drives learning.

I’ve explained each of these elements in detail, including the DQN formula and full training loop, in this article: Deep Q Network (DQN) – Formula and Explanation

If you’ve already read that guide, you’re ready to move on.

If not, I strongly recommend checking it. It will make the rest of this tutorial much easier to follow.

Next, let’s apply this knowledge to the MountainCar environment and build the DQN agent in practice.


STEP 3: The MountainCar Problem

The MountainCar environment is a classical application in Reinforcement Learning. The goal of this environment is simple. A small car must reach the top of a hill, but the engine is too weak to climb directly.
There is one way to do it. The only way to succeed is to move back and forth, using gravity as an ally to gain enough momentum.

This task perfectly demonstrates the concept of delayed rewards and the importance of long-term planning in RL. The agent needs to “see”, know what it “can do”, and how it is “evaluated“. The agent uses states to see, action to control, and rewards to know if what he is doing is good or bad.

State Space (Observation Space)

The observation is a vector with 2 values:

  • car position with values between [−1.2, 0.6]
  • car velocity with values between [−0.07, 0.07]

These two values are continuous, meaning the agent must learn over a smooth range of states rather than fixed bins. So:

state=[position,velocity]

In DQN, these values ​​are passed directly as input to the neural network, without discretization.

Action Space

There are 3 discrete actions:

  • 0 → push left
  • 1 → no push (do nothing)
  • 2 → push right

Reward Function

  • The agent receives −1 for each step (lost time),
  • The episode ends when the agent reaches the top of the mountain (position ≥ 0.5),
  • The goal is to minimize the total number of steps.

Note: The environment automatically terminates after 200 steps if the goal hasn’t been reached.

Criterion of Success

The goal is clear: the episode ends successfully when the car reaches the goal position at the top of the right hill (position ≥ 0.5). The agent should learn to reach this point in the fewest possible steps.

Now that we’ve defined the problem, we can start building the DQN model using Stable Baselines3.


STEP 4: Building the DQN Agent with Stable Baselines3

So far we have understood how MountainCar works, the DQN algorithm, and what the agent needs to do. Now we will build the agent that will learn on its own.

In this step, I will explain everything related to defining the agent architecture + configuring the learning parameters. We do not yet need to “train” or “visualize”. I will explain these two concepts in the next steps.

In the previous step we saw that our agent sees two things (position and speed). It can do three actions (left, right, nothing). And depending on what action it does, it receives a reward at each step.

In this step, we will build the DQN agent using the Stable Baselines3 library. Basically, we will tell the computer what kind of brain the neural network has and how it will learn.

Everything we discussed in the chapter on Deep Q-Learning (how the network works, how to use gamma, replay buffer, etc.), we will now put into code.

4.1 Imports and code initialization

Before the agent can start learning, we need to get him all the parts and tools he needs. This is what “imports” mean: that is, you tell the computer which “toolboxes” to use.

For this application we will use Python. So in the Python script we will import Gymnasium, and from Stable Baselines3 we import the DQN algorithm.

import gymnasium as gym
from stable_baselines3 import DQN

4.2 Creating the environment

It’s time to tell the program where the agent will train. It’s like choosing a place for a robot to do its exercises. In our case, it’s the MountainCar hill.

env = gym.make("MountainCar-v0")

4.3 Policy choice and explanation

Here we tell the agent what kind of brain it will use to learn. In Stable Baselines3, this is called policy.

policy = "MlpPolicy"

What does this mean:

“Mlp” stands for Multi-Layer Perceptron, which is a small artificial brain (neural network) that learns from data.

“Policy” is the part that decides the next action. That is, “what does the agent do now?”

4.4 Instantiating the DQN model

“Instantiate” is a technical word, but it means something simple. We’re actually creating the agent. That is, we’re building the artificial brain that will teach itself how to climb the hill.

The main line of code is:

model = DQN(policy, env, learning_rate=..., gamma=..., buffer_size=..., exploration_fraction=...)

This is the line of code that creates the DQN agent. Each part in the parentheses is an instruction for how the agent will learn.

4.4.1 policy

Here we choose what kind of brain we want. 

In most cases, you choose “MlpPolicy“. A brain made of simple, interconnected neurons.

For example, if you have a robot that learns from images (like video games), we have to use “CnnPolicy“. It is an artificial brain that “sees” images.

Think of it as choosing whether our agent is “good at thinking” (Mlp) or “good at seeing” (Cnn). For MountainCar, we choose MlpPolicy. It only works with numbers (position and speed).

4.4.2 env

This is where we tell it where to train.

We’re using the MountainCar environment, so it’s like saying: “Hey, you’re going to train on the MountainCar hill.”

This is the exact same env that we created earlier with gym.make(“MountainCar-v0”).

4.4.3 learning_rate

This controls how fast the agent learns.

  • If it is too high -> it learns too fast and makes big mistakes.
  • If it is too low -> it learns very slowly and takes forever to catch on.

Usually, for DQN, we choose a learning_rate of 0.0001.

4.4.4 gamma (γ – discount factor)

This tells how much the future matters to the agent.

If gamma is high (e.g. 0.99), the robot is thinking about the future. It wants to do good actions in the long term. If gamma is low (e.g. 0.5), it focuses on immediate rewards.

I wrote a dedicated article for this “Discount Factor Explained – Why Gamma (γ) Makes or Breaks Learning.” There I explain exactly why this number can completely change the way the agent learns.

4.4.5 buffer_size (Replay Buffer)

It’s like a memory where the agent remembers past experiences (states, actions, rewards).

Instead of learning only from the last attempt, the agent also learns from what it did before.

This helps it learn more stably and correctly.

4.4.6. exploration_fraction

This controls how much the agent experiments.

At first, the agent knows nothing, so it has to try many different actions (exploration).

As it learns, it starts doing the things that brought it good rewards more and more often (exploitation).

exploration_fraction tells how much of the training the agent will continue to explore.


STEP 5: Training the Agent and Tracking Progress

So far we have understood the application, the algorithm, and how to create a DQN agent. In this step, I will give practical explanations on how to run the training, and visual interpretation to understand what is happening from the logs and TensorBoard.

5.1 What happens when training starts

What happens internally when model.learn() starts?

  • the agent observes the state (position, velocity),
  • chooses an action (left/right/nothing),
  • receives reward,
  • saves the transition in the replay buffer,
  • uses batches from the buffer to update the network.

All of these steps are done as a result of running the following function from our Python file. When you run this function, imagine the car trying over and over again to climb the hill. Each episode teaches it a little more about how to use gravity, momentum, and timing.

def train_agent(model: DQN, total_timesteps: int = 200_000) -> None:
    """Train the DQN agent."""
    print("\n   Starting training...\n")
    model.learn(total_timesteps=total_timesteps, log_interval=10)
    model.save("dqn_mountaincar_model")
    print("\nTraining complete. Model saved as 'dqn_mountaincar_model.zip'")

Line by line function explanation

def train_agent(model: DQN, total_timesteps: int = 200_000) -> None:

This is a function that starts the training of the agent.

The model parameter is the DQN agent built in the previous step.

total_timesteps represents how long the agent will train (the total number of experience steps in the environment). Here 200_000 is the default value. When we run the Python file, we can pass a different value for total_timesteps from the command line. Then the function will use that new value instead of the default one.

    model.learn(total_timesteps=total_timesteps, log_interval=10)

This is the most important line in the function. This is where the internal learning loop in Stable Baselines3 starts.

model.learn() does everything automatically:

  • the agent interacts with the environment,
  • collects experiences (state, action, reward, next_state),
  • save transitions in the Replay Buffer,
  • extracts mini-batches and updates the neural network,
  • periodically synchronizes the target network,
  • and logs the metrics (ep_rew_mean, loss, exploration_rate) at every interval.

The parameter log_interval=10 means that every 10 episodes, SB3 will display the progress in the console (average reward, episode length, etc.).

model.save("dqn_mountaincar_model")

After the training is complete, this line saves the trained model to a ZIP file.

The file contains:

  • the weights of the neural network,
  • the optimizer state,
  • information about the policy and environment.

We can later load this file for demo or fine-tuning.

5.2 Project Structure

Save the bellow file dqn_mountaincar.py inside a new folder (for example:
C:\Users\<your_name>\mountainCar). This will keep all training logs and models neatly organized.

"""
DQN Training and Demo Script for MountainCar-v0
------------------------------------------------
Train or test a Deep Q-Learning agent using Stable Baselines3 (SB3).
Usage:
  python dqn_mountaincar.py --train --timesteps 800_000   # Train a new agent
  python dqn_mountaincar.py --demo                        # Run the trained agent demo

Author: Calin Dragos George
Created: 2025-11-10
"""

import argparse
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure


def create_env(env_name: str = "MountainCar-v0", render: bool = False) -> gym.Env:
    """Create and wrap the Gymnasium environment."""
    if render:
        env = gym.make(env_name, render_mode="human")
    else:
        env = gym.make(env_name)
        env = Monitor(env)
    return env


def create_model(env: gym.Env) -> DQN:
    """Initialize the DQN model with core hyperparameters."""
    model = DQN(
        policy="MlpPolicy",
        env=env,
        learning_rate=1e-4,              # optimizer step size
        gamma=0.99,                      # discount factor (see article on gamma)
        buffer_size=50_000,              # replay buffer size
        learning_starts=1_000,           # delay before updates start
        batch_size=128,                  # batch size for gradient step
        train_freq=4,                    # train every n steps
        target_update_interval=2_000,    # how often to sync target network
        exploration_fraction=0.3,        # fraction of total steps for epsilon decay
        exploration_initial_eps=1.0,     # start with full exploration
        exploration_final_eps=0.05,      # minimum exploration
        verbose=1,
        tensorboard_log="./logs/"
    )
    return model


def train_agent(model: DQN, total_timesteps: int = 200_000) -> None:
    """Train the DQN agent."""
    print("\n   Starting training...\n")
    model.learn(total_timesteps=total_timesteps, log_interval=10)
    model.save("dqn_mountaincar_model")
    print("\nTraining complete. Model saved as 'dqn_mountaincar_model.zip'")


def demo_agent(env_name: str = "MountainCar-v0", n_episodes: int = 10) -> None:
    """Run the trained agent for multiple demo episodes."""
    print(f"\n Starting demo mode for {n_episodes} episodes...\n")
    env = create_env(env_name, render=True)
    model = DQN.load("dqn_mountaincar_model", env=env)

    total_rewards = []

    for episode in range(n_episodes):
        obs, _ = env.reset()
        done = False
        episode_reward = 0

        while not done:
            action, _ = model.predict(obs, deterministic=True)
            obs, reward, terminated, truncated, _ = env.step(action)
            done = terminated or truncated
            episode_reward += reward

        total_rewards.append(episode_reward)
        print(f"Episode {episode + 1} | Reward: {episode_reward:.2f}")

    mean_reward = sum(total_rewards) / len(total_rewards)
    print(f"\n Demo finished | Mean reward over {n_episodes} episodes: {mean_reward:.2f}")
    env.close()


def parse_args() -> argparse.Namespace:
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(description="Train or run DQN agent for MountainCar.")
    parser.add_argument("--train", action="store_true", help="Train the DQN agent")
    parser.add_argument("--demo", action="store_true", help="Run demo with trained agent")
    parser.add_argument("--timesteps", type=int, default=200_000, help="Total timesteps for training")
    return parser.parse_args()


if __name__ == "__main__":
    args = parse_args()
    logger = configure("./logs/", ["stdout", "tensorboard"])

    if args.train:
        env = create_env()
        model = create_model(env)
        train_agent(model, total_timesteps=args.timesteps)

    elif args.demo:
        demo_agent()

    else:
        print("\n Please specify a mode:")
        print("   python dqn_mountaincar.py --train   (to train a new agent)")
        print("   python dqn_mountaincar.py --demo    (to run a demo)\n")

Before training the agent, make sure you start a Conda environment, and your files are organized like this:

mountainCar/
│
├── dqn_mountaincar.py        ← main script (training + demo)
├── logs/                     ← TensorBoard logs (auto-generated)
└── dqn_mountaincar_model.zip ← trained model (created after training)

5.3 Running the training

This is how to run the training:

python dqn_mountaincar.py --train --timesteps 800000

Explanation:

  • –train → starts training mode,
  • –timesteps → total number of steps (e.g. 800K for good results).

You can stop and resume training with the same model (loaded with DQN.load()).

5.4 What the console output means during training

When you run the script in training mode, the console displays logs like this:

Console log in training mode
Console log in training mode
rollout/
PARAMETERMEANING
ep_len_meanThe average episode length (how many steps the agent takes before the episode ends).
Example: 149 means that, on average, each episode lasts 149 steps. Shorter episodes usually mean better performance.
ep_rew_meanThe average reward per episode. This shows if the agent is improving. Here it’s -149, which is good for MountainCar (close to solving the task).
exploration_rateThe current epsilon (ε) used for exploration. 0.05 means the agent is still taking random actions 5 % of the time to keep exploring.
time/
PARAMETERMEANING
episodesTotal number of episodes completed so far (4330).
fpsFrames per second. How fast the training loop is running.
Higher = faster training (depends on CPU/GPU speed).
time_elapsedTotal training time in seconds (642 s ≈ 10 min).
total_timestepsThe total number of interactions between the agent and the environment.
Each “timestep” = one action taken and processed.
train/
PARAMETERMEANING
learning_rateThe current step size for updating the neural network (0.0001). Smaller values = slower but more stable learning.
lossThe training loss of the neural network. How far predictions are from target Q-values.
Smaller and stable values mean the model is converging.
n_updatesHow many times the neural network has been updated so far (≈ 195 806).

5.5 Saving the trained model

model.save("dqn_mountaincar_model")

The model saves neural weights, agent parameterization, and environment information.
The dqn_mountaincar_model.zip file can be loaded later for demo or fine-tuning.

5.6 Tracking progress with TensorBoard

TensorBoard learning curve for MountainCar with DQN
TensorBoard learning curve for MountainCar with DQN

Note: If you don’t have TensorBoard installed, follow the steps in this tutorial: How to Install OpenAI Gymnasium in Windows and Launch Your First Python RL Environment

This curve in TensorBoard shows exactly the kind of realistic learning I want to show in this tutorial about DQN. It clearly illustrates how the agent goes through successive phases of exploration, progress, and temporary instability before stabilizing.

The curve represents ep_rew_mean (average reward per episode) as a function of the number of timesteps.
The values ​​are between −200 (total failure) and −140 (good performance). The goal is to see if the agent learns an increasingly efficient policy (the curve increases in the long run).

What the shape of the curve tells us

It shows:

  • a long exploration phase,
  • a clear discovery of strategy,
  • a few stabilizing oscillations,
  • and a partial convergence towards a viable policy.

This is a sign of authentic deep reinforcement learning. DQN is known for such “learning waves“, because the updates are dependent on buffers and synchronizations between networks.


STEP 6: Running the Trained Agent (Demo Mode)

Running the Trained Agent (Demo Mode) for MountainCar
Running the Trained Agent (Demo Mode) for MountainCar

After training, it’s time to see what the agent has actually learned.

In this step, I’ll show you how to load the trained model, run it in demo mode, and visualize how it moves inside the MountainCar environment.

6.1 Loading the trained model

model = DQN.load("dqn_mountaincar_model", env=env)

What this line does:

  • Loads the file saved at the end of the training,
  • The file contains the network weights and agent configuration.

6.2 Setting up the environment for rendering

env = gym.make("MountainCar-v0", render_mode="human")

This Python line:

  • creates the environment with an active view,
  • without render_mode=”human”, you will not see the graphics window,
  • Gymnasium renders the car’s movement and the agent’s choices in real time.

6.3 Predicting actions and running episodes

The main logic of the demo loop:

obs, _ = env.reset()
done = False
while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, _ = env.step(action)
    done = terminated or truncated

What needs to be emphasized here is that:

  • model.predict() -> the agent chooses an action based on the trained neural network,
  • deterministic=True -> the agent stops exploring; it only uses the optimal policy,
  • env.step(action) -> the environment executes the action and returns the next state + reward.

During training, the agent explores random actions.

During the demo, it acts deterministically. It’s based on what it has learned.

6.4 Multiple demo episodes

We can run multiple consecutive episodes:

for episode in range(10):
    ...

The goal is to see the consistency of behavior.

  • A good agent will succeed in 7 – 9 out of 10 episodes,
  • An unstable agent will succeed in only a few.

In my tests, the trained agent succeeded in most runs, with rewards around −143 to −150, proving that it learned to use gravity to reach the goal.

6.5 Running the demo

The Python file above contains both the training and demo modes. Here’s how to run the demo:

python dqn_mountaincar.py --demo

6.6 Example output

Demo 1 MountainCar and DQN
Demo 1 MountainCar and DQN
Demo 2 MountainCar and DQN
Demo 2 MountainCar and DQN

These scores correspond directly to the reward curve you saw in the TensorBoard graphic. Some episodes succeed, others fail. But overall, the trend shows consistent learning.


Generalizing to Other Environments

In this part of the tutorial, I’ll show you how to reuse the exact same pipeline for CartPole, Acrobot, LunarLander, etc.

The architecture used in this tutorial and the code can be reused with minimal modification. So the Python file for DQN is universal, not built just for MountainCar.

DQN is a model-free algorithm, so it does not depend on the physics of the environment, but only on the interface (state, action, reward).

How to switch environments

To change the environment, we need to change a single line.

For MountainCar we have the line with this environment:

env = gym.make("MountainCar-v0")

To change the environment, we just need to modify this line. For example, if we want to change from MountainCar to LunarLander, the line becomes:

env = gym.make("LunarLander-v2")

Everything else [the DQN structure, training, and tracking] remains exactly the same.

Typical parameter changes

Depending on the training environment, we need to make small adjustments:

PARAMETERWHAT WE MODIFYWHY WE MODIFY
learning_rate3e-4 for CartPolesmall networks learn quickly
gamma0.95 for CartPole, 0.99 for Landerdifference between short/long tasks
total_timesteps200K – 1Mdepends on the complexity
buffer_size50K – 200Kfor more complex tasks

The more complex the environment, the longer the training and the larger the replay buffer should be.


Key Takeaways and Next Steps

In this tutorial, we built, trained, and visualized a Deep Q-Learning agent from scratch using PyTorch, Gymnasium, and Stable Baselines3.

We started from the MountainCar problem. Then created the environment. The next step was to build the DQN architecture. We trained the model, and finally watched it succeed through trial and error.

In this tutorial:

  • You learned how DQN replaces Q-tables with a neural network,
  • You saw how experience replay and target networks stabilize learning,
  • You tracked progress with TensorBoard and interpreted the learning curve,
  • You ran and visualized your own trained agent,
  • You now have a complete DQN pipeline reusable for any Gymnasium environment.

After you download the complete code, you can experiment too. You can upgrade the algorithm. You can try Double DQN or Dueling DQN. The goal is to reduce overestimation.

Or you can do an environment upgrade. Train on LunarLander-v2 or CartPole-v1.

Reinforcement learning is not just about algorithms. It’s about persistence.
Like the MountainCar, every failed episode teaches you something that brings you closer to the goal.

Keep experimenting, keep training.

Tags: Bellman EquationHyperparameter TuningQ-FunctionReplay Buffer
ShareTweetShareShareSend
Previous Post

Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium

Next Post

The Complete Guide of Learning Rate in RL 

Related Posts

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows
MuJoCo

How To Setup MuJoCo, Gymnasium, PyTorch, SB3 and TensorBoard on Windows

March 4, 2026
What is Actor-Critic in Reinforcement Learning?
Deep RL Algorithms

What is Actor-Critic in Reinforcement Learning?

January 20, 2026
Next Post
The Complete Guide of Learning Rate in RL 

The Complete Guide of Learning Rate in RL 

The six pillars of PPO stability and performance in Stable-Baselines3

The Complete Practical Guide to PPO with Stable-Baselines3

Hands-On: Min-Max Normalization In Action

Hands-On: Min-Max Normalization In Action

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

About the author

About Dragos Calin

Dragos Calin is a robotics engineer and reinforcement learning practitioner focused on building real-world autonomous and remote-controlled robotics for agriculture, edge-AI robotics, and embedded platforms. His work join simulation, machine learning, and hardware deployment, with a strong emphasis on practical, testable solutions that function outside the lab.

Areas of Expertise:

  • # Reinforcement Learning for Robotics
  • # Autonomous Agricultural Robots
  • # Embedded Systems & Edge AI (Jetson, Raspberry Pi, Arduino)
  • # Robotic Simulation & Sim2Real Workflow
  • # Sensor Fusion & Control Systems
  • # ROS-Based Robotics Development

Tags

Actor-Critic Bellman Equation Evaluation Metrics Exploitation Exploration Hyperparameter Tuning Machine Learning Markov Decision Process MDP MDP (Markov Decision Process) Normalization Partial Observability POMDP Q-Function Replay Buffer Temporal Difference TensorBoard
Newsletter

Subscribe Blog for Latest Updates

To stay updated with our newest projects and tutorials, make sure you subscribe to our newsletter. 

We do not share your information! You can subscribe  at any time. By subscribing you agree to our Privacy Policy.

Stay Tuned – Follow Us

To stay updated with our newest projects and tutorials, make sure you follow us on: Twitter / X

Site Information

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 Reinforcement Learning Path

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • RL Fundamentals
    • Learn to train intelligent agents that actually converge
      • RL FOUNDATION
      • CLASSIC DEEP RL APPLICATION
    • Q-Learning
  • Deep RL Algorithms
    • DQN
    • PPO
    • SAC
  • Simulation & Environments
    • OpenAI Gymnasium
  • Tools, Code & Experiment Design
    • PyTorch
    • Stable-Baselines3

© 2026 Reinforcement Learning Path