This page was last edited on 24 April 2025
Structure of the tutorial
This tutorial has six parts:
→ PART 1: Overview of the tutorial
→ PART 3: Markov Decision Process (MDP)
→ PART 4: Choosing the Algorithm (DQN)
→ PART 5: Environment + RL Model + Reward Function
→ PART 6: Training + Testing + Google Colab Access ← you are here
Training the Model
This is where learning actually happens. At this step we let the agent interact with the environment. It tries different actions, observes results, stores experiences, and learns from them.
We are using experience replay, mini-batches, and a stable target network to make the training process robust.
Key Concepts
- Episodes & Steps
- Training is divided into episodes.
- Each episode has multiple steps.
- At each step, the agent observes the state, takes an action, gets a reward, and updates the model.
- Epsilon-Greedy Policy
- In early episodes, the agent explores (random actions).
- Over time, it exploits what it has learned (chooses the best action).
- Controlled by epsilon – the exploration rate.
- Experience Replay
- Instead of learning from recent steps only, we store past experiences in a buffer.
- During training, we randomly sample from this buffer.
- This breaks correlations between consecutive steps and stabilizes training.
- Mini-Batch Learning
- We don’t update the model with one experience. We sample a mini-batch of experiences to make the training more stable.
- Target Network (for DQN)
- The agent uses a second network (target network) to calculate the target Q-values. This network is updated slowly to prevent instability.
- Loss Function and Backpropagation
- The goal is to reduce the difference between predicted Q-values and target Q-values. We use Mean Squared Error (MSE) loss. The model updates weights using backpropagation and an optimizer like Adam.
- Training Loop
- Run for a number of episodes. Save the best-performing model. Log the total reward per episode to monitor learning progress.
Testing
At this step we stop training and freeze the model weights. We have to evaluate how well the agent has learned to recognize digit 3. No exploration. Only exploitation. This step validates everything we’ve done so far.
We use the trained policy to make predictions, then compare the predicted actions with ground truth labels.
This gives us the real performance of the model.
Key Concepts
Deterministic Policy
During testing, we use the best action directly from the model. No randomness. No epsilon. The agent picks argmax(Q(state)).
Evaluation Dataset
We test on a separate dataset. This ensures the model generalizes well. Never test on training data.
Evaluation Metrics
Since this is a classification-style task (digit recognition), we compute:
- Accuracy → % of correct predictions
- Precision → how many predicted 3s are actually 3
- Recall → how many actual 3s the agent detects
- F1 Score → balance between precision and recall
Pseudocode
# 1. Problem Definition
DEFINE Agent
DEFINE Environment
DEFINE interaction loop
DEFINE evaluation metrics (Accuracy, Precision, Recall, F1)
# 2. MDP
DEFINE MDP as tuple (S, A, R, P, γ, π)
S = all 28×28 grayscale digit images
A = {0 = NO, 1 = YES}
R = +1 if prediction is correct, -1 if incorrect
P = deterministic transition to next image
γ = 0 # No long-term reward needed
π = learned policy from DQN
# 3. The Algorithm
SELECT Deep Q-Network (DQN)
DQN Steps:
- Predict Q-values
- Select action (epsilon-greedy)
- Observe reward and next_state
- Store experience
- Sample minibatch
- Compute target Q
- Update network weights
- Sync target network periodically
# 4. Simulation Environment
METHOD reset():
- return random image as initial state
METHOD step(action):
- compare action to ground truth
- return next_state, reward, done, info
# 5. Defining the RL Model
DEFINE NeuralNetwork:
- INPUT: 1 x 28 x 28 normalized grayscale image
- LAYERS:
- OUTPUT: Raw Q-values (no activation)
# 6. Implementing the Reward Function
IF action matches true label:
reward = +1
ELSE:
reward = -1
# 7. Training the Model
FOR each episode:
state = env.reset()
WHILE not done:
action = select from policy(state)
next_state, reward, done = env.step(action)
store experience(state, action, reward, next_state, done)
sample minibatch
compute Q targets
update Q-network
state = next_state
# 8. Testing
FOR each test image:
action = predict with trained model
compare action with true label
update evaluation metrics
OUTPUT:
Accuracy, Precision, Recall, F1 Score
Google Colab Access

Want to try this?
Enter your email address below, and we’ll send you the direct link to the Google Colab project for detecting the digit 3 with Deep Reinforcement Learning.
The project is ready to use and easy to run.
Start learning and testing today.