This page was last edited on 24 April 2025
Structure of the tutorial
This tutorial has six parts:
→ PART 1: Overview of the tutorial
→ PART 3: Markov Decision Process (MDP) ← you are here
→ PART 4: Choosing the Algorithm (DQN)
→ PART 5: Environment + RL Model + Reward Function
→ PART 6: Training + Testing + Google Colab Access
What is Markov Decision Process (MDP)?
In Deep Reinforcement Learning(RL), everything starts with the Markov Decision Process(MDP). It defines how the agent sees the world, how it acts, and how it learns from outcomes.
An MDP is a mathematical framework. It breaks down the learning problem into 5 (sometimes 6) elements.
It helps us model the environment in a way that is clear, repeatable, and trainable.
We use the MDP framework to turn goals such as “learn to detect the number 3” into precise rules, actions, and feedback.
Without an MDP, the agent wouldn’t know what a state is, what actions it can take, or what reward means.
Every Deep RL algorithm is based on this idea: “The agent interacts with an environment, receives a reward, and learns what to do next.”
The MDP gives structure to that interaction. It defines the rules of the game.
Once we define the MDP clearly, we can:
- Build training environments
- Choose the right algorithm (like DQN or PPO)
- Track performance with metrics
- Adjust rewards, states, and actions for better learning
No MDP → no learning.
What each section of the MDP means
1. Objective
What’s the long-term goal of the agent?
What behavior are we trying to learn?
This is where we shape the objective in terms of maximizing reward over time.
It sets the direction for the entire training process.
2. Reward Function
The reward function defines how the agent knows if it did well or poorly.
Every decision is judged with +1, 0, or -1 (or any value we choose). The reward is the only feedback the agent sees.
Good reward design = good learning.
Bad reward = confusion and failure.
3. State Space
State space is what information the agent receives at each step.
In our case, a state is a raw image from MNIST.
It’s what the agent sees — and what it uses to make decisions.
4. Action Space
The action space defines all the possible moves of the agent.
In our application the action space is binary: YES (1) or NO (0).
Each action leads to a new reward and new state.
5. Discount Factor γ
The discount factor (gamma) tells the agent how much to value future rewards.
γ = 0 means the agent only cares about immediate results.
γ closer to 1 means the agent thinks long-term.
In Deep RL, this balance is critical.
6. Transition Function (optional)
In model-free RL, we don’t define this function that tells how the world changes after each action.
We don’t know or care how the environment works internally — we just observe what happens.
If we used model-based RL, we’d need this. But in this application, we skip it.
Markov Decision Process (MDP) for teaching an artificial brain to discover what 3 looks like
1. Defining the Objective of the System
The goal is to teach an agent to recognize the digit 3 using trial and error.
The agent receives an image, chooses an action: YES (it’s a 3) or NO (it’s not).
The agent learns by maximizing cumulative reward over time. It must discover what 3 looks like, not be told.
Simple rules, complex learning.
Key concepts:
- Cumulative reward maximization – we guide learning by total reward per episode.
- Reward-driven behavior – the agent learns what leads to positive feedback.
- Objective shaping – we translate the task (detect 3) into a sequence of binary decisions.
2. Defining the Reward Function R
The agent gets feedback after each decision. Rewards are binary: +1 or -1. No partial scores.
Reward structure:
- +1 → YES and the image contains a 3 (true positive (TP))
- +1 → NO and the image does not contain a 3 (true negative (TN))
- -1 → YES but there is no 3 (false positive (FP))
- -1 → NO but the image does contain a 3 (false negative(FN))
Equation to calculate the accuracy of the agent is:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-db70a8a5a96e50343afbd8ef0790c258_l3.png)
Why do we have to include all the above metrics and not just the TP and TN?
Accuracy can be misleading in imbalanced datasets. Imagine a dataset with 1000 images, but only 50 contain the digit 3, and the remaining 950 do not.
If the agent always predicts “NO” (i.e., “this is not a 3”), then:
- TP = 0
- TN = 950
- FP = 0
- FN = 50
- Accuracy = (TP + TN) / Total = 950 / 1000 = 95%
This gives the illusion of high performance, yet the agent fails to recognize any digit 3.
Key concepts:
- Reward shaping – immediate feedback after every image.
- Penalties – discourage false decisions.
- Dense reward – the agent receives reward at every step. No delay.
- No sparse episodes – feedback comes instantly, not at episode end.
3. Defining the State Space S
Each state is a raw image from the MNIST dataset. We use the original grayscale image, 28×28 pixels, 1 channel. No feature extraction. No flattening. No manual processing.
Images are normalized between [0.0, 1.0]. We pass raw pixels directly into the Convolutional Neural Network (CNN).
MNIST dataset details:
- Dataset: MNIST – Modified National Institute of Standards and Technology
- Creators: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
- License: CC0 1.0 Public Domain – free for commercial and academic use
- Source: http://yann.lecun.com/exdb/mnist/
Key concepts:
- Raw input – no handcrafted features.
- State abstraction – learned by CNN layers, not predefined.
- Normalization – helps with stable learning and faster convergence.
4. Defining the Action Space A
The agent can choose one of two actions: A = {0: NO, 1: YES}
One action per image. Each action represents a full decision. The action space is discrete and fixed.
Key concepts:
- Action discretization – binary decision problem.
- Action granularity – coarse-grain, but sufficient.
- No action clipping needed – all actions are valid and bounded.
5. Defining the Discount Factor γ
The environment gives immediate rewards. But we still want the agent to think long-term. Use a moderate discount factor: γ = 0.9
This value encourages the agent to maximize reward across episodes, not just per image.
Key concepts:
- Temporal credit assignment – even simple environments benefit from temporal structure.
- Long-term stability – helps the agent avoid overfitting to local patterns.
- Episode learning – reward builds up across many steps.
6. (Optional) Transition Function P
This is a model-free RL setup. We don’t define or learn transition probabilities. The next state is just the next image from the dataset. The agent does not influence the environment – no control over image sequence.
Key concepts:
- Transition probabilities P(s’|s,a) – ignored.
- Approximate dynamics – not needed.
- Model-based planning – not used here.
Next: Choosing the Algorithm (DQN)
Now that your problem and MDP are defined, it’s time to choose how the agent will learn.
This is where Deep Reinforcement Learning becomes practical.
There are many algorithms — DQN, PPO, A3C, SAC — each with pros and cons.
But not all of them fit every task.
In Part 4, you’ll see why DQN is the best choice for this application.
You’ll learn:
→ what DQN is,
→ how it works,
→ why it fits this binary classification problem,
→ and what to watch out for during training.