This page was last edited on 24 April 2025
Structure of the tutorial
This tutorial has six parts:
→ PART 1: Overview of the tutorial
→ PART 2: Problem Definition ← you are here
→ PART 3: Markov Decision Process (MDP)
→ PART 4: Choosing the Algorithm (DQN)
→ PART 5: Environment + RL Model + Reward Function
→ PART 6: Training + Testing + Google Colab Access
What you will learn in part 2
Before writing any code, we must define the problem clearly. Not just what we want to solve, but how the agent should think, act, and improve.
This part of the tutorial focuses on two essential elements:
- Problem Description – What the agent must learn, how it behaves, and how it gets rewarded or penalized.
- Problem Definition – A complete thinking structure. Based on real Reinforcement Learning(RL) theory.
You’ll get a full list of questions — and answers — that help transform abstract ideas into working logic.
This section is the blueprint. Without it, training becomes trial and error — but without the learning.
You’ll define:
→ the goal,
→ the environment,
→ the rewards,
→ the risks,
→ the metrics,
→ and how to break the problem into smart, testable steps.
This structure is not just for this application. It works for any Deep RL problem — in robotics, vision, control, finance, real estate, logistics, healthcare, games, and energy systems. One process. Many use cases.
It’s how Deep RL professionals think before they train.
Problem Description
The goal of this application is to use Deep Reinforcement Learning to teach an artificial brain how to recognize the digit 3 in images.
The agent starts blind. It sees an image, it doesn’t know what the digit is.
It must learn — through rewards and punishment — to tell us if the digit from an image is 3 or not.
This application is not about building the best classifier. This is about teaching a brain to discover what 3 looks like. Just like a human would — try, fail, learn.
If the agent says “3” when the image contains a 3 → reward.
If the agent says “3” when the image does not contain a 3 → penalty.
If the agent misses a 3 → penalty.
Simple idea. Hard learning process. But this is the core of intelligence: learning by interacting.
Problem Definition
To write this section, I use key concepts from Deep Reinforcement Learning: agent, environment, actions, rewards, feedback loops, state transitions, and performance metrics.
The questions from problem definition help define what problem we solve, how the agent interacts, and what success looks like.
Each answer is built using standard elements from RL theory and practice. Without clear answers, the training process becomes unclear, hard to debug, and impossible to scale.
The full structure includes:
→ 1. Problem Objective – What the agent must do and why
→ 2. Problem Complexity and Nature – How hard the problem is, and why RL fits
→ 3. Environment Dynamics – How the environment behaves
→ 4. Constraints – Time, hardware, resources
→ 5. Risk – What can go wrong and how we control that
→ 6. Problem Decomposition – How to split the problem into smaller steps
→ 7. Evaluation Metrics – How we know if the agent is improving
It’s a thinking tool. It turns abstract goals into concrete implementation steps.
1. Problem Objective
1.1 What is the long-term goal of the RL agent?
Response: to learn through experience and interaction how to detect the presence of digit 3 in an image. Whether the image contains only digit 3, other digits, or a mix — the agent must recognize when a 3 is present and when it’s not.
1.2 What task must the agent perform, and in what context?
Response: the agent receives an image that may contain one or more digits. It must decide if the digit 3 is present in that image or not.
Context: binary classification of presence (“YES” for 3, “NO” otherwise), based on visual input.
1.3 What decisions must the agent make, and how frequently?
Response: At every step (one image per step), the agent must choose:
- “YES” if the image contains a 3 (alone or with others)
- “NO” if the image does not contain any 3
One decision per image.
1.4 What metrics will be used to measure RL success?
Response:
- Accuracy: correct YES/NO predictions
- True Positives: images with 3 correctly identified(TP)
- True Negatives: images without 3 correctly rejected(TN)
- False Positives: images without 3 wrongly classified as YES(FP)
- False Negatives: images with 3 missed(FN)
- Cumulative reward across episodes
1.5 How will we track the agent’s progress across learning phases (exploration, stabilization, long-term optimization)?
Response: We track reward evolution and accuracy over time.
We observe the agent in:
- Early exploration (random answers)
- Gradual stabilization (fewer errors)
- Late optimization (learning complex patterns and mixed digits)
1.6 What rewards and penalties will the agent receive for its actions?
Response:
- +1 if agent says “YES” and the image contains digit 3
- -1 if agent says “YES” and there is no 3
- -1 if agent says “NO” and the image does contain a 3
- +1 if agent says “NO” and the image does not contain 3
2. Problem Complexity and Nature
2.1 Is the action space discrete, continuous, or mixed?
Response: Discrete: {YES, NO}
2.2 Are there constraints (physical, logical, resource-based) that must be respected?
Response: Logical constraint: only one action per image. No physical constraints.
2.3 Is RL justified for this problem? Are there simpler alternatives (e.g., supervised learning, rule-based logic)?
Response: Yes, the goal is not just solving classification — but learning to recognize a concept (the digit 3) through trial and error. This is part of a didactic Deep Reinforcement Learning pipeline.
2.4 Does the agent need to learn causal relationships or just correlations?
Response: Correlations between visual features and the concept of “3”. Over time, the agent learns that certain pixel patterns → reward.
3. Environment Dynamics
3.1 Is the environment static or does it change over time?
Response: Static. The agent sees different images at each step, but the way those images are generated and how rewards are given stays the same. The environment’s rules don’t change over time.
3.2 Are there stochastic or uncertain elements in the feedback?
Response: No. Reward feedback is deterministic and consistent.
3.3 How will the agent balance exploration and exploitation to learn effectively?
Response: Using ε-greedy strategy:
- High exploration at start
- Gradual shift to exploitation as reward increases
3.4 Does the environment provide immediate or delayed feedback (rewards)?
Response: Immediate. Reward is given right after each decision.
4. Constraints
4.1 Are there time limits for training or inference?
Response: No. Training is offline. Inference is real-time-capable.
4.2 What hardware/software resources are available?
Response: GPU (local or Colab), PyTorch or TensorFlow.
4.3 Does the RL agent need to act in real time or can it compute solutions with delay?
Response: No. During training, real-time response is not needed. The agent can take its time to learn, optimize, and update its policy.
However, after training, inference should be fast — ideally real-time — so that the agent can classify images instantly when deployed.
4.4 What decision-making horizon is relevant? (e.g., per frame, per episode, long-term strategy)
Response: Per image (short horizon), with long-term learning over many episodes.
5. Risk
5.1 What risks arise if the agent makes incorrect decisions?
Response: None in this project. The application is educational and offline.
5.2 How do we ensure the agent learns safe and correct behavior?
Response: By defining a clear and consistent reward structure. We have to visualize predictions vs ground truth regularly.
5.3 How do we validate the agent before deployment? (e.g., simulation, controlled testing, formal verification)
Response: Use a validation/test image set. Compare performance metrics over unseen images.
6. Problem Decomposition
6.1 Can the problem be split into smaller tasks or progressive phases?
Response: Yes:
- Start with single-digit images (3 vs. non-3)
- Move to multi-digit images
- Add noise or distortions later
6.2 How can we define intermediate goals to guide the agent?
Response:
- Learn to recognize 3 in clean images
- Learn to reject images without 3
- Learn to detect 3 in noisy or cluttered images
6.3 What system components can be developed and tested independently?
Response:
- The image environment (image generator + label + reward)
- The agent architecture (neural net + policy)
- The training loop
- The evaluation and logging tools
7. Evaluation Metrics
7.1 What metrics will we use to evaluate performance?
Response:
- Accuracy
- Precision and recall for detecting digit 3
- False positive and false negative rates
- Episode reward trend
7.2 How do we ensure these metrics truly reflect the application’s final goal?
Response: Because the goal is to detect the presence of 3, precision and recall are more important than global accuracy. Tracking TP/FP/FN directly maps to the real objective.
Next: Markov Decision Process (MDP)
Now that the problem is defined, we move to the core structure of every RL system: the Markov Decision Process.
This is where theory meets implementation.
You’ll map your problem into 6 elements:
→ states,
→ actions,
→ rewards,
→ transitions,
→ discount factor,
→ and policy.
This is the logic engine behind all Deep RL agents. If you skip this step or do it wrong, the agent will learn… the wrong thing.