Choosing the Algorithm – Application 1

This page was last edited on 13 May 2025

This tutorial has six parts:

 PART 1: Overview of the tutorial 

 PART 2: Problem Definition 

 PART 3: Markov Decision Process (MDP)

 PART 4: Choosing the Algorithm (DQN) ← you are here

 PART 5: Environment + RL Model + Reward Function

 PART 6: Training + Testing + Google Colab Access


In this chapter, we choose the Deep Reinforcement Learning(RL) algorithm for the task. In this part of the tutorial, I explain why I picked it based on the Markov Decision Process (MDP). Then, I break it down step by step — from prediction to learning.

Everything will follow the logic of the application:

Agent sees an image → decides YES or NO → gets feedback → learns from experience.


In this application we will use the Deep Q-Learning (DQN) algorithm.

DQN stands for Deep Q-Network. It extends Q-learning with a neural network.

We use it to approximate the Q-function:

Q(s, a) = expected return when taking action a in state s

It’s one of the most common Deep RL algorithms. And it fits perfectly here — the action space is discrete and small. The state (image) is fixed-size and raw, so we can pass it through a CNN. Reward is dense, feedback is immediate. We don’t need a complex actor-critic setup.

DQN combines ideas from:

Why DQN fits this MDP:
  • Action space (A): discrete → YES or NO → perfect for Q-values
  • State space (S): raw image → use CNN to extract features
  • Reward (R): dense, binary → ideal for Q-value updates
  • Environment: model-free → DQN doesn’t require knowing transition function
  • Goal: maximize cumulative reward → classic DQN setup

These methods make DQN efficient, stable, and scalable — even when input is raw visual data. It’s not the best for all problems, but it’s perfect for this one.


Here’s how DQN works in this task. The algorithm implementation is split into 8 parts.

Each one maps to what the agent does during learning:

1. Predict

Estimate Q-values for each action, based on the current state (image).
We pass the image through a CNN neural network.

Output: two values → Q(s, YES), Q(s, NO)

2. Action

Choose the action using ε-greedy strategy:

  • With probability ε → random action (exploration)
  • With 1−ε → choose the action with max Q-value (exploitation)

3. Feedback

Apply the chosen action. Environment returns:

  • Reward (R)
  • Next state (next image)

This is the full experience: (s, a, r, s’)

4. Store

Save the experience into a replay buffer. This allows learning from past transitions, not just recent ones. We store many transitions for stable training.

5. Sample

At each learning step, sample a batch of past transitions from the buffer.  Batch size is usually 32 or 64. This breaks correlation between consecutive frames.

6. Compute

Use the Bellman equation to compute target Q-values:

Q_target = r + γ * max_a’ Q_target(s’, a’)

We use a separate target network to compute the second term.

7. Update

Minimize the Temporal Difference(TD) error between:

  • Current Q-values from main network
  • Target Q-values from Bellman update

We use gradient descent (usually Adam) to update network weights.

8. Sync

Every few steps, copy weights from the main network → target network.
This stabilizes training.  It prevents Q-value estimates from moving too fast.


Now that you know the algorithm, it’s time to build the system around it.

This is where theory becomes code.

In Part 5, you’ll create:

→ the simulation environment,
→ the CNN-based Q-network,
→ and the reward function that drives learning.

You’ll define how the agent sees the world, how it makes decisions, and how it gets rewarded.

No abstractions. Just real components, ready to train.


Part 5: Environment + RL Model + Reward Function