Markov Decision Process (MDP) – Application 1

This page was last edited on 24 April 2025

This tutorial has six parts:

 PART 1: Overview of the tutorial 

 PART 2: Problem Definition

 PART 3: Markov Decision Process (MDP)you are here

 PART 4: Choosing the Algorithm (DQN)

 PART 5: Environment + RL Model + Reward Function

 PART 6: Training + Testing + Google Colab Access


In Deep Reinforcement Learning(RL), everything starts with the Markov Decision Process(MDP). It defines how the agent sees the world, how it acts, and how it learns from outcomes.

An MDP is a mathematical framework. It breaks down the learning problem into 5 (sometimes 6) elements.

It helps us model the environment in a way that is clear, repeatable, and trainable.

We use the MDP framework to turn goals such as “learn to detect the number 3” into precise rules, actions, and feedback.

Without an MDP, the agent wouldn’t know what a state is, what actions it can take, or what reward means.

Every Deep RL algorithm is based on this idea: “The agent interacts with an environment, receives a reward, and learns what to do next.

The MDP gives structure to that interaction. It defines the rules of the game.

Once we define the MDP clearly, we can:

  • Build training environments
  • Choose the right algorithm (like DQN or PPO)
  • Track performance with metrics
  • Adjust rewards, states, and actions for better learning

No MDP → no learning.

What each section of the MDP means

1. Objective

What’s the long-term goal of the agent?
What behavior are we trying to learn?
This is where we shape the objective in terms of maximizing reward over time.
It sets the direction for the entire training process.

2. Reward Function

The reward function defines how the agent knows if it did well or poorly.
Every decision is judged with +1, 0, or -1 (or any value we choose). The reward is the only feedback the agent sees.
Good reward design = good learning.
Bad reward = confusion and failure.

3. State Space

State space is what information the agent receives at each step.
In our case, a state is a raw image from MNIST.
It’s what the agent sees — and what it uses to make decisions.

4. Action Space

The action space defines all the possible moves of the agent.
In our application the action space is binary: YES (1) or NO (0).
Each action leads to a new reward and new state.

5. Discount Factor γ

The discount factor (gamma) tells the agent how much to value future rewards.
γ = 0 means the agent only cares about immediate results.
γ closer to 1 means the agent thinks long-term.
In Deep RL, this balance is critical.

6. Transition Function (optional)

In model-free RL, we don’t define this function that tells how  the world changes after each action.
We don’t know or care how the environment works internally — we just observe what happens.
If we used model-based RL, we’d need this. But in this application, we skip it.


1. Defining the Objective of the System

The goal is to teach an agent to recognize the digit 3 using trial and error.
The agent receives an image, chooses an action: YES (it’s a 3) or NO (it’s not).

The agent learns by maximizing cumulative reward over time. It must discover what 3 looks like, not be told.
Simple rules, complex learning.

Key concepts:

  • Cumulative reward maximization – we guide learning by total reward per episode.
  • Reward-driven behavior – the agent learns what leads to positive feedback.
  • Objective shaping – we translate the task (detect 3) into a sequence of binary decisions.

2. Defining the Reward Function R

The agent gets feedback after each decision. Rewards are binary: +1 or -1. No partial scores.

Reward structure:

  • +1 → YES and the image contains a 3 (true positive (TP))
  • +1 → NO and the image does not contain a 3 (true negative (TN))
  • -1 → YES but there is no 3 (false positive (FP))
  • -1 → NO but the image does contain a 3 (false negative(FN))

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]


Key concepts:

  • Reward shaping – immediate feedback after every image.
  • Penalties – discourage false decisions.
  • Dense reward – the agent receives reward at every step. No delay.
  • No sparse episodes – feedback comes instantly, not at episode end.

3. Defining the State Space S

Each state is a raw image from the MNIST dataset. We use the original grayscale image, 28×28 pixels, 1 channel. No feature extraction. No flattening. No manual processing.

Images are normalized between [0.0, 1.0]. We pass raw pixels directly into the Convolutional Neural Network (CNN).

MNIST dataset details:

  • Dataset: MNIST – Modified National Institute of Standards and Technology
  • Creators: Yann LeCun, Corinna Cortes, Christopher J.C. Burges
  • License: CC0 1.0 Public Domain – free for commercial and academic use
  • Source: http://yann.lecun.com/exdb/mnist/

Key concepts:

  • Raw input – no handcrafted features.
  • State abstraction – learned by CNN layers, not predefined.
  • Normalization – helps with stable learning and faster convergence.

4. Defining the Action Space A

The agent can choose one of two actions: A = {0: NO, 1: YES}

One action per image. Each action represents a full decision. The action space is discrete and fixed.

Key concepts:

  • Action discretization – binary decision problem.
  • Action granularity – coarse-grain, but sufficient.
  • No action clipping needed – all actions are valid and bounded.

5. Defining the Discount Factor γ

The environment gives immediate rewards. But we still want the agent to think long-term. Use a moderate discount factor: γ = 0.9

This value encourages the agent to maximize reward across episodes, not just per image.

Key concepts:

  • Temporal credit assignment – even simple environments benefit from temporal structure.
  • Long-term stability – helps the agent avoid overfitting to local patterns.
  • Episode learning – reward builds up across many steps.

6. (Optional) Transition Function P

This is a model-free RL setup. We don’t define or learn transition probabilities. The next state is just the next image from the dataset. The agent does not influence the environment – no control over image sequence.

Key concepts:

  • Transition probabilities P(s’|s,a) – ignored.
  • Approximate dynamics – not needed.
  • Model-based planning – not used here.

Now that your problem and MDP are defined, it’s time to choose how the agent will learn.

This is where Deep Reinforcement Learning becomes practical.

There are many algorithms — DQN, PPO, A3C, SAC — each with pros and cons.

But not all of them fit every task.

In Part 4, you’ll see why DQN is the best choice for this application.

You’ll learn:

→ what DQN is,
→ how it works,
→ why it fits this binary classification problem,
→ and what to watch out for during training.


Part 4: Choosing the Algorithm