In this tutorial, I’ll show you how to build the brain of a DQN agent, train it to master MountainCar, and finally watch it learn. All these steps use PyTorch, Gymnasium, and Stable Baselines3 – forming a complete, reusable DQN pipeline.
There’s a little agent, let’s call it the DQN Agent.
It struggles to climb a hill. It tries once, twice, a thousand times.
But after thousands of failures, something changes.
It learns how to accelerate exactly when it should. It learns to use gravity to its advantage. It learns purely from feedback. No instructions, no teacher.
Just experience.
That’s exactly what Deep Q-Learning does. And in this tutorial, you’ll build your own agent that learns this skill entirely from scratch.
Table of Contents
- STEP 1: Environment Setup – Installing PyTorch, Gymnasium, and Stable Baselines3
- STEP 2: Understanding Deep Q-Learning
- STEP 3: The MountainCar Problem
- STEP 4: Building the DQN Agent with Stable Baselines3
- STEP 5: Training the Agent and Tracking Progress
- STEP 6: Running the Trained Agent (Demo Mode)
- Generalizing to Other Environments
- Key Takeaways and Next Steps
STEP 1: Environment Setup – Installing PyTorch, Gymnasium, and Stable Baselines3
Before we start building the first DQN agent, we need to make sure the environment is ready.
Our agent will need three essential tools:
- PyTorch, for building and running the neural network that learns,
- Gymnasium, to provide the MountainCar environment,
- Stable Baselines3, to manage the DQN training process.
If you haven’t installed them yet, I’ve written a detailed guide that covers every step of installation. The tutorial includes the installation steps for both Windows and Linux, and how to verify that everything works correctly.
You can follow it here: Tutorial: How to Install Stable-Baselines3 the Right Way (Windows & Linux): PyTorch + Gymnasium
Once your environment is ready, come back here and let’s build the DQN pipeline together.
If you already have your environment set up, make sure you have the following versions (or newer):
python --version # >= 3.10
To check that everything works, try running:
import torch, gymnasium, stable_baselines3
print('Setup complete!')
The result should display something like:

Now that the setup and environment is ready, it’s time to understand how Deep Q-Learning actually works and what makes it different from classical Q-Learning. Let’s see what’s happening inside the agent’s brain.
STEP 2: Understanding Deep Q-Learning
If we’ve reached the second step, it means that the setup with libraries, frameworks, and environment is ready to use.
The goal of step 2 is to understand how Deep Q-Learning works. In most cases, you’ll also find Deep Q-Learning referred to as Deep Q-Network (DQN).
In traditional Q-Learning, we map every state-action pair to a value and store them into a Q-table.
Deep Q-Learning replaces that table with a neural network that estimates Q-values for each possible action. This method allows the agent to generalize across continuous or high-dimensional environments such as continuous state spaces (e.g., positions, velocities, angles, pixel observations).
To train this network, DQN introduces three key components:
- Policy Network and Target Network – one for learning, one for stability.
- Replay Buffer – a memory that stores past experiences and helps the agent learn from them more efficiently.
- Bellman Equation in Neural Form – the core update rule that drives learning.
I’ve explained each of these elements in detail, including the DQN formula and full training loop, in this article: Deep Q Network (DQN) – Formula and Explanation
If you’ve already read that guide, you’re ready to move on.
If not, I strongly recommend checking it. It will make the rest of this tutorial much easier to follow.
Next, let’s apply this knowledge to the MountainCar environment and build the DQN agent in practice.
STEP 3: The MountainCar Problem
The MountainCar environment is a classical application in Reinforcement Learning. The goal of this environment is simple. A small car must reach the top of a hill, but the engine is too weak to climb directly.
There is one way to do it. The only way to succeed is to move back and forth, using gravity as an ally to gain enough momentum.
This task perfectly demonstrates the concept of delayed rewards and the importance of long-term planning in RL. The agent needs to “see”, know what it “can do”, and how it is “evaluated“. The agent uses states to see, action to control, and rewards to know if what he is doing is good or bad.
State Space (Observation Space)
The observation is a vector with 2 values:
- car position with values between [−1.2, 0.6]
- car velocity with values between [−0.07, 0.07]
These two values are continuous, meaning the agent must learn over a smooth range of states rather than fixed bins. So:
state=[position,velocity]
In DQN, these values are passed directly as input to the neural network, without discretization.
Action Space
There are 3 discrete actions:
- 0 → push left
- 1 → no push (do nothing)
- 2 → push right
Reward Function
- The agent receives −1 for each step (lost time),
- The episode ends when the agent reaches the top of the mountain (position ≥ 0.5),
- The goal is to minimize the total number of steps.
Note: The environment automatically terminates after 200 steps if the goal hasn’t been reached.
Criterion of Success
The goal is clear: the episode ends successfully when the car reaches the goal position at the top of the right hill (position ≥ 0.5). The agent should learn to reach this point in the fewest possible steps.
Now that we’ve defined the problem, we can start building the DQN model using Stable Baselines3.
STEP 4: Building the DQN Agent with Stable Baselines3
So far we have understood how MountainCar works, the DQN algorithm, and what the agent needs to do. Now we will build the agent that will learn on its own.
In this step, I will explain everything related to defining the agent architecture + configuring the learning parameters. We do not yet need to “train” or “visualize”. I will explain these two concepts in the next steps.
In the previous step we saw that our agent sees two things (position and speed). It can do three actions (left, right, nothing). And depending on what action it does, it receives a reward at each step.
In this step, we will build the DQN agent using the Stable Baselines3 library. Basically, we will tell the computer what kind of brain the neural network has and how it will learn.
Everything we discussed in the chapter on Deep Q-Learning (how the network works, how to use gamma, replay buffer, etc.), we will now put into code.
4.1 Imports and code initialization
Before the agent can start learning, we need to get him all the parts and tools he needs. This is what “imports” mean: that is, you tell the computer which “toolboxes” to use.
For this application we will use Python. So in the Python script we will import Gymnasium, and from Stable Baselines3 we import the DQN algorithm.
import gymnasium as gym from stable_baselines3 import DQN
4.2 Creating the environment
It’s time to tell the program where the agent will train. It’s like choosing a place for a robot to do its exercises. In our case, it’s the MountainCar hill.
env = gym.make("MountainCar-v0")
4.3 Policy choice and explanation
Here we tell the agent what kind of brain it will use to learn. In Stable Baselines3, this is called policy.
policy = "MlpPolicy"
What does this mean:
“Mlp” stands for Multi-Layer Perceptron, which is a small artificial brain (neural network) that learns from data.
“Policy” is the part that decides the next action. That is, “what does the agent do now?”
4.4 Instantiating the DQN model
“Instantiate” is a technical word, but it means something simple. We’re actually creating the agent. That is, we’re building the artificial brain that will teach itself how to climb the hill.
The main line of code is:
model = DQN(policy, env, learning_rate=..., gamma=..., buffer_size=..., exploration_fraction=...)
This is the line of code that creates the DQN agent. Each part in the parentheses is an instruction for how the agent will learn.
4.4.1 policy
Here we choose what kind of brain we want.
In most cases, you choose “MlpPolicy“. A brain made of simple, interconnected neurons.
For example, if you have a robot that learns from images (like video games), we have to use “CnnPolicy“. It is an artificial brain that “sees” images.
Think of it as choosing whether our agent is “good at thinking” (Mlp) or “good at seeing” (Cnn). For MountainCar, we choose MlpPolicy. It only works with numbers (position and speed).
4.4.2 env
This is where we tell it where to train.
We’re using the MountainCar environment, so it’s like saying: “Hey, you’re going to train on the MountainCar hill.”
This is the exact same env that we created earlier with gym.make(“MountainCar-v0”).
4.4.3 learning_rate
This controls how fast the agent learns.
- If it is too high -> it learns too fast and makes big mistakes.
- If it is too low -> it learns very slowly and takes forever to catch on.
Usually, for DQN, we choose a learning_rate of 0.0001.
4.4.4 gamma (γ – discount factor)
This tells how much the future matters to the agent.
If gamma is high (e.g. 0.99), the robot is thinking about the future. It wants to do good actions in the long term. If gamma is low (e.g. 0.5), it focuses on immediate rewards.
I wrote a dedicated article for this “Discount Factor Explained – Why Gamma (γ) Makes or Breaks Learning.” There I explain exactly why this number can completely change the way the agent learns.
4.4.5 buffer_size (Replay Buffer)
It’s like a memory where the agent remembers past experiences (states, actions, rewards).
Instead of learning only from the last attempt, the agent also learns from what it did before.
This helps it learn more stably and correctly.
4.4.6. exploration_fraction
This controls how much the agent experiments.
At first, the agent knows nothing, so it has to try many different actions (exploration).
As it learns, it starts doing the things that brought it good rewards more and more often (exploitation).
exploration_fraction tells how much of the training the agent will continue to explore.
STEP 5: Training the Agent and Tracking Progress
So far we have understood the application, the algorithm, and how to create a DQN agent. In this step, I will give practical explanations on how to run the training, and visual interpretation to understand what is happening from the logs and TensorBoard.
5.1 What happens when training starts
What happens internally when model.learn() starts?
- the agent observes the state (position, velocity),
- chooses an action (left/right/nothing),
- receives reward,
- saves the transition in the replay buffer,
- uses batches from the buffer to update the network.
All of these steps are done as a result of running the following function from our Python file. When you run this function, imagine the car trying over and over again to climb the hill. Each episode teaches it a little more about how to use gravity, momentum, and timing.
def train_agent(model: DQN, total_timesteps: int = 200_000) -> None:
"""Train the DQN agent."""
print("\n Starting training...\n")
model.learn(total_timesteps=total_timesteps, log_interval=10)
model.save("dqn_mountaincar_model")
print("\nTraining complete. Model saved as 'dqn_mountaincar_model.zip'")
Line by line function explanation
def train_agent(model: DQN, total_timesteps: int = 200_000) -> None:
This is a function that starts the training of the agent.
The model parameter is the DQN agent built in the previous step.
total_timesteps represents how long the agent will train (the total number of experience steps in the environment). Here 200_000 is the default value. When we run the Python file, we can pass a different value for total_timesteps from the command line. Then the function will use that new value instead of the default one.
model.learn(total_timesteps=total_timesteps, log_interval=10)
This is the most important line in the function. This is where the internal learning loop in Stable Baselines3 starts.
model.learn() does everything automatically:
- the agent interacts with the environment,
- collects experiences (state, action, reward, next_state),
- save transitions in the Replay Buffer,
- extracts mini-batches and updates the neural network,
- periodically synchronizes the target network,
- and logs the metrics (ep_rew_mean, loss, exploration_rate) at every interval.
The parameter log_interval=10 means that every 10 episodes, SB3 will display the progress in the console (average reward, episode length, etc.).
model.save("dqn_mountaincar_model")
After the training is complete, this line saves the trained model to a ZIP file.
The file contains:
- the weights of the neural network,
- the optimizer state,
- information about the policy and environment.
We can later load this file for demo or fine-tuning.
5.2 Project Structure
Save the bellow file dqn_mountaincar.py inside a new folder (for example:
C:\Users\<your_name>\mountainCar). This will keep all training logs and models neatly organized.
"""
DQN Training and Demo Script for MountainCar-v0
------------------------------------------------
Train or test a Deep Q-Learning agent using Stable Baselines3 (SB3).
Usage:
python dqn_mountaincar.py --train --timesteps 800_000 # Train a new agent
python dqn_mountaincar.py --demo # Run the trained agent demo
Author: Calin Dragos George
Created: 2025-11-10
"""
import argparse
import gymnasium as gym
from stable_baselines3 import DQN
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.logger import configure
def create_env(env_name: str = "MountainCar-v0", render: bool = False) -> gym.Env:
"""Create and wrap the Gymnasium environment."""
if render:
env = gym.make(env_name, render_mode="human")
else:
env = gym.make(env_name)
env = Monitor(env)
return env
def create_model(env: gym.Env) -> DQN:
"""Initialize the DQN model with core hyperparameters."""
model = DQN(
policy="MlpPolicy",
env=env,
learning_rate=1e-4, # optimizer step size
gamma=0.99, # discount factor (see article on gamma)
buffer_size=50_000, # replay buffer size
learning_starts=1_000, # delay before updates start
batch_size=128, # batch size for gradient step
train_freq=4, # train every n steps
target_update_interval=2_000, # how often to sync target network
exploration_fraction=0.3, # fraction of total steps for epsilon decay
exploration_initial_eps=1.0, # start with full exploration
exploration_final_eps=0.05, # minimum exploration
verbose=1,
tensorboard_log="./logs/"
)
return model
def train_agent(model: DQN, total_timesteps: int = 200_000) -> None:
"""Train the DQN agent."""
print("\n Starting training...\n")
model.learn(total_timesteps=total_timesteps, log_interval=10)
model.save("dqn_mountaincar_model")
print("\nTraining complete. Model saved as 'dqn_mountaincar_model.zip'")
def demo_agent(env_name: str = "MountainCar-v0", n_episodes: int = 10) -> None:
"""Run the trained agent for multiple demo episodes."""
print(f"\n Starting demo mode for {n_episodes} episodes...\n")
env = create_env(env_name, render=True)
model = DQN.load("dqn_mountaincar_model", env=env)
total_rewards = []
for episode in range(n_episodes):
obs, _ = env.reset()
done = False
episode_reward = 0
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
episode_reward += reward
total_rewards.append(episode_reward)
print(f"Episode {episode + 1} | Reward: {episode_reward:.2f}")
mean_reward = sum(total_rewards) / len(total_rewards)
print(f"\n Demo finished | Mean reward over {n_episodes} episodes: {mean_reward:.2f}")
env.close()
def parse_args() -> argparse.Namespace:
"""Parse command line arguments."""
parser = argparse.ArgumentParser(description="Train or run DQN agent for MountainCar.")
parser.add_argument("--train", action="store_true", help="Train the DQN agent")
parser.add_argument("--demo", action="store_true", help="Run demo with trained agent")
parser.add_argument("--timesteps", type=int, default=200_000, help="Total timesteps for training")
return parser.parse_args()
if __name__ == "__main__":
args = parse_args()
logger = configure("./logs/", ["stdout", "tensorboard"])
if args.train:
env = create_env()
model = create_model(env)
train_agent(model, total_timesteps=args.timesteps)
elif args.demo:
demo_agent()
else:
print("\n Please specify a mode:")
print(" python dqn_mountaincar.py --train (to train a new agent)")
print(" python dqn_mountaincar.py --demo (to run a demo)\n")
Before training the agent, make sure you start a Conda environment, and your files are organized like this:
mountainCar/ │ ├── dqn_mountaincar.py ← main script (training + demo) ├── logs/ ← TensorBoard logs (auto-generated) └── dqn_mountaincar_model.zip ← trained model (created after training)
5.3 Running the training
This is how to run the training:
python dqn_mountaincar.py --train --timesteps 800000
Explanation:
- –train → starts training mode,
- –timesteps → total number of steps (e.g. 800K for good results).
You can stop and resume training with the same model (loaded with DQN.load()).
5.4 What the console output means during training
When you run the script in training mode, the console displays logs like this:

rollout/
| PARAMETER | MEANING |
|---|---|
| ep_len_mean | The average episode length (how many steps the agent takes before the episode ends). Example: 149 means that, on average, each episode lasts 149 steps. Shorter episodes usually mean better performance. |
| ep_rew_mean | The average reward per episode. This shows if the agent is improving. Here it’s -149, which is good for MountainCar (close to solving the task). |
| exploration_rate | The current epsilon (ε) used for exploration. 0.05 means the agent is still taking random actions 5 % of the time to keep exploring. |
time/
| PARAMETER | MEANING |
|---|---|
| episodes | Total number of episodes completed so far (4330). |
| fps | Frames per second. How fast the training loop is running. Higher = faster training (depends on CPU/GPU speed). |
| time_elapsed | Total training time in seconds (642 s ≈ 10 min). |
| total_timesteps | The total number of interactions between the agent and the environment. Each “timestep” = one action taken and processed. |
train/
| PARAMETER | MEANING |
|---|---|
| learning_rate | The current step size for updating the neural network (0.0001). Smaller values = slower but more stable learning. |
| loss | The training loss of the neural network. How far predictions are from target Q-values. Smaller and stable values mean the model is converging. |
| n_updates | How many times the neural network has been updated so far (≈ 195 806). |
5.5 Saving the trained model
model.save("dqn_mountaincar_model")
The model saves neural weights, agent parameterization, and environment information.
The dqn_mountaincar_model.zip file can be loaded later for demo or fine-tuning.
5.6 Tracking progress with TensorBoard

Note: If you don’t have TensorBoard installed, follow the steps in this tutorial: How to Install OpenAI Gymnasium in Windows and Launch Your First Python RL Environment
This curve in TensorBoard shows exactly the kind of realistic learning I want to show in this tutorial about DQN. It clearly illustrates how the agent goes through successive phases of exploration, progress, and temporary instability before stabilizing.
The curve represents ep_rew_mean (average reward per episode) as a function of the number of timesteps.
The values are between −200 (total failure) and −140 (good performance). The goal is to see if the agent learns an increasingly efficient policy (the curve increases in the long run).
What the shape of the curve tells us
It shows:
- a long exploration phase,
- a clear discovery of strategy,
- a few stabilizing oscillations,
- and a partial convergence towards a viable policy.
This is a sign of authentic deep reinforcement learning. DQN is known for such “learning waves“, because the updates are dependent on buffers and synchronizations between networks.
STEP 6: Running the Trained Agent (Demo Mode)

After training, it’s time to see what the agent has actually learned.
In this step, I’ll show you how to load the trained model, run it in demo mode, and visualize how it moves inside the MountainCar environment.
6.1 Loading the trained model
model = DQN.load("dqn_mountaincar_model", env=env)
What this line does:
- Loads the file saved at the end of the training,
- The file contains the network weights and agent configuration.
6.2 Setting up the environment for rendering
env = gym.make("MountainCar-v0", render_mode="human")
This Python line:
- creates the environment with an active view,
- without render_mode=”human”, you will not see the graphics window,
- Gymnasium renders the car’s movement and the agent’s choices in real time.
6.3 Predicting actions and running episodes
The main logic of the demo loop:
obs, _ = env.reset()
done = False
while not done:
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
What needs to be emphasized here is that:
- model.predict() -> the agent chooses an action based on the trained neural network,
- deterministic=True -> the agent stops exploring; it only uses the optimal policy,
- env.step(action) -> the environment executes the action and returns the next state + reward.
During training, the agent explores random actions.
During the demo, it acts deterministically. It’s based on what it has learned.
6.4 Multiple demo episodes
We can run multiple consecutive episodes:
for episode in range(10):
...
The goal is to see the consistency of behavior.
- A good agent will succeed in 7 – 9 out of 10 episodes,
- An unstable agent will succeed in only a few.
In my tests, the trained agent succeeded in most runs, with rewards around −143 to −150, proving that it learned to use gravity to reach the goal.
6.5 Running the demo
The Python file above contains both the training and demo modes. Here’s how to run the demo:
python dqn_mountaincar.py --demo
6.6 Example output


These scores correspond directly to the reward curve you saw in the TensorBoard graphic. Some episodes succeed, others fail. But overall, the trend shows consistent learning.
Generalizing to Other Environments
In this part of the tutorial, I’ll show you how to reuse the exact same pipeline for CartPole, Acrobot, LunarLander, etc.
The architecture used in this tutorial and the code can be reused with minimal modification. So the Python file for DQN is universal, not built just for MountainCar.
DQN is a model-free algorithm, so it does not depend on the physics of the environment, but only on the interface (state, action, reward).
How to switch environments
To change the environment, we need to change a single line.
For MountainCar we have the line with this environment:
env = gym.make("MountainCar-v0")
To change the environment, we just need to modify this line. For example, if we want to change from MountainCar to LunarLander, the line becomes:
env = gym.make("LunarLander-v2")
Everything else [the DQN structure, training, and tracking] remains exactly the same.
Typical parameter changes
Depending on the training environment, we need to make small adjustments:
| PARAMETER | WHAT WE MODIFY | WHY WE MODIFY |
|---|---|---|
| learning_rate | 3e-4 for CartPole | small networks learn quickly |
| gamma | 0.95 for CartPole, 0.99 for Lander | difference between short/long tasks |
| total_timesteps | 200K – 1M | depends on the complexity |
| buffer_size | 50K – 200K | for more complex tasks |
The more complex the environment, the longer the training and the larger the replay buffer should be.
Key Takeaways and Next Steps
In this tutorial, we built, trained, and visualized a Deep Q-Learning agent from scratch using PyTorch, Gymnasium, and Stable Baselines3.
We started from the MountainCar problem. Then created the environment. The next step was to build the DQN architecture. We trained the model, and finally watched it succeed through trial and error.
In this tutorial:
- You learned how DQN replaces Q-tables with a neural network,
- You saw how experience replay and target networks stabilize learning,
- You tracked progress with TensorBoard and interpreted the learning curve,
- You ran and visualized your own trained agent,
- You now have a complete DQN pipeline reusable for any Gymnasium environment.
After you download the complete code, you can experiment too. You can upgrade the algorithm. You can try Double DQN or Dueling DQN. The goal is to reduce overestimation.
Or you can do an environment upgrade. Train on LunarLander-v2 or CartPole-v1.
Reinforcement learning is not just about algorithms. It’s about persistence.
Like the MountainCar, every failed episode teaches you something that brings you closer to the goal.
Keep experimenting, keep training.





