This page was last edited on 13 May 2025
Structure of the tutorial
This tutorial has six parts:
→ PART 1: Overview of the tutorial
→ PART 3: Markov Decision Process (MDP)
→ PART 4: Choosing the Algorithm (DQN) ← you are here
→ PART 5: Environment + RL Model + Reward Function
→ PART 6: Training + Testing + Google Colab Access
What you will learn in part 4
In this chapter, we choose the Deep Reinforcement Learning(RL) algorithm for the task. In this part of the tutorial, I explain why I picked it based on the Markov Decision Process (MDP). Then, I break it down step by step — from prediction to learning.
Everything will follow the logic of the application:
Agent sees an image → decides YES or NO → gets feedback → learns from experience.
Choosing the algorithm
In this application we will use the Deep Q-Learning (DQN) algorithm.
DQN stands for Deep Q-Network. It extends Q-learning with a neural network.
We use it to approximate the Q-function:
Q(s, a) = expected return when taking action a in state s
It’s one of the most common Deep RL algorithms. And it fits perfectly here — the action space is discrete and small. The state (image) is fixed-size and raw, so we can pass it through a CNN. Reward is dense, feedback is immediate. We don’t need a complex actor-critic setup.
DQN combines ideas from:
- Q-learning – classic reinforcement learning algorithm;
- Function approximation – uses a CNN to estimate Q-values;
- Experience replay – stores past transitions and samples batches randomly;
- Target network – stabilizes learning by delaying updates to the target Q-function;
- Bellman equation – computes Q-targets for learning;
- Temporal Difference(TD) error minimization – updates the Q-network using gradient descent;
- ε-greedy exploration vs exploitation – balances random exploration and greedy exploitation;
Why DQN fits this MDP:
- Action space (A): discrete → YES or NO → perfect for Q-values
- State space (S): raw image → use CNN to extract features
- Reward (R): dense, binary → ideal for Q-value updates
- Environment: model-free → DQN doesn’t require knowing transition function
- Goal: maximize cumulative reward → classic DQN setup
These methods make DQN efficient, stable, and scalable — even when input is raw visual data. It’s not the best for all problems, but it’s perfect for this one.
DQN Structure – Step by Step
Here’s how DQN works in this task. The algorithm implementation is split into 8 parts.
Each one maps to what the agent does during learning:
1. Predict
Estimate Q-values for each action, based on the current state (image).
We pass the image through a CNN neural network.
Output: two values → Q(s, YES), Q(s, NO)
2. Action
Choose the action using ε-greedy strategy:
- With probability ε → random action (exploration)
- With 1−ε → choose the action with max Q-value (exploitation)
3. Feedback
Apply the chosen action. Environment returns:
- Reward (R)
- Next state (next image)
This is the full experience: (s, a, r, s’)
4. Store
Save the experience into a replay buffer. This allows learning from past transitions, not just recent ones. We store many transitions for stable training.
5. Sample
At each learning step, sample a batch of past transitions from the buffer. Batch size is usually 32 or 64. This breaks correlation between consecutive frames.
6. Compute
Use the Bellman equation to compute target Q-values:
Q_target = r + γ * max_a’ Q_target(s’, a’)
We use a separate target network to compute the second term.
7. Update
Minimize the Temporal Difference(TD) error between:
- Current Q-values from main network
- Target Q-values from Bellman update
We use gradient descent (usually Adam) to update network weights.
8. Sync
Every few steps, copy weights from the main network → target network.
This stabilizes training. It prevents Q-value estimates from moving too fast.
Next: Environment + RL Model + Reward Function
Now that you know the algorithm, it’s time to build the system around it.
This is where theory becomes code.
In Part 5, you’ll create:
→ the simulation environment,
→ the CNN-based Q-network,
→ and the reward function that drives learning.
You’ll define how the agent sees the world, how it makes decisions, and how it gets rewarded.
No abstractions. Just real components, ready to train.