This page was last edited on 12 November 2025
Imagine America in the 1950s:
- Computers had just appeared. The military and industry wanted methods for making optimal decisions under conditions of uncertainty.
- Richard Bellman had invented dynamic programming. The technique used to break down a large problem into small pieces.
- Ronald Howard comes along and says: “Well, if we combine the idea of the Markov process (where the future depends only on the present) with dynamic programming, we get a general framework for decisions.“
Thus was born the Markov Decision Process (MDP) – a simple model, but with universal applicability: from robots and games to medicine and economics.

What is an MDP?
The MDP is a story you tell the AI agent about reality, but it’s not the reality. Reality is continuous, noisy, high-dimensional, and full of unknowns.
In theory, a MDP is a mathematical framework used to describe decision-making problems where outcomes are partly under the agent’s control and partly random. In reinforcement learning, it defines how an agent interacts with the environment over time to learn optimal behavior.
In other words, a MDP models an environment where an agent makes decisions step-by-step. At each time step, the agent observes the current state, takes an action, receives a reward, and transitions to a new state. The process continues as the agent learns to maximize long-term rewards.
Formal Definition
An MDP is defined by a 5-tuple:
(S, A, P, R, γ)
Where:
- S – Set of possible states
- A – Set of possible actions
- P – Transition probability function (defines how actions move the agent from one state to another)
- R – Reward function
- γ – Discount factor (controls how much future rewards matter)
The Markov Property
The key assumption in MDPs is the Markov property:
The future depends only on the current state and action — not on the full history.
This means the agent doesn’t need to remember the past. Just the current state is enough to make an optimal decision. In other words, the decision rule is memoryless.
What does this mean practically?
In robotics, if we know a robot’s current position and velocity, we can predict its next state after applying a motor command. We don’t need to know where it was 10 seconds ago. That’s the essence of the Markov property.
ANALOGY
Think of playing chess blindfolded, but your assistant tells you the full board every turn. You don’t need to remember previous positions — just the current board is enough. That’s a Markov decision setup.
Types of MDPs
Deterministic vs. Stochastic
- Deterministic: The next state is fully predictable from the current state and action.
- Stochastic: The next state is random, based on a probability distribution (more realistic in robotics).
Finite vs. Infinite Horizon
- Finite: The task ends after a fixed number of steps (e.g., a robot must reach a goal in 100 steps).
- Infinite: The task continues forever — we focus on long-term behavior.
Episodic vs. Continuing
- Episodic: Tasks have natural resets (e.g., a robot finishes one delivery, then starts another).
- Continuing: The agent never resets, and learning happens continuously (e.g., robot patrolling an area).
Components of an MDP
An MDP is defined by five core components: state space, action space, transition probabilities, reward function, and discount factor. Each plays a distinct role in how an agent learns and interacts with the environment.
1. State Space (S)
The state space decides if learning is possible or not. The state represents everything the agent needs to make decisions.
State design is not about completeness, it’s about sufficiency. We need to give the agent just enough states to make good decisions, but not so much so that it drowns.
In robotics, this often includes:
- Position (x, y, z)
- Orientation (e.g., yaw, pitch, roll)
- Sensor readings (camera, LiDAR, IMU, encoders, etc.)
- Environment info (obstacles, map, targets)
A state can be low-dimensional (e.g., 2D position) or high-dimensional (e.g., full image input).
How you represent the state matters. Poor state design = poor learning.
2. Action Space (A)
Action space design is a negotiation between what’s “natural” for the problem and what’s “feasible” for the algorithm.
Actions are the control commands the agent can issue.
- Discrete actions – move forward, turn left, pick object
- Continuous actions – set motor speed to 0.5, turn 12.4 degrees, apply torque
In robotics:
- For a mobile robot: actions = linear and angular velocities
- For a robotic arm: actions = joint angles or torques
The action space defines what the agent can do – and shapes the learning algorithm used.
3. Transition Probabilities (P)
Most RL work isn’t “learning in a clean MDP,” it’s patching around the cracks in the transition dynamics you can’t model correctly.
The environment’s response to an action is:
P(s’ | s, a) – the probability of reaching next state s’ from current state s after taking action a.
- In deterministic systems: next state is fixed.
- In real-world robotics: transitions are often stochastic due to noise, delays, or external forces.
Transition models are critical in planning, prediction, and simulation.
Some RL methods learn them; others assume we don’t have access to them (model-free).
4. Reward Function (R)
Every reward turns your MDP into a biased mirror of reality, and your agent is living inside that mirror, not the world you thought you described.
The reward function defines how the agent is scored.
R(s, a) or R(s, a, s’) — scalar value received after taking action a in state s.
Designing rewards is an art:
- Sparse reward – only gives reward at the goal (harder to learn).
- Dense reward – gives feedback frequently (easier to learn, but risk of bias).
- Reward shaping – adds artificial feedback to guide the agent faster.
- Hard-coded reward – simple rules (e.g., -1 for crash, +1 for success).
The reward tells the agent what to learn, so its definition directly impacts learning success. But when defining rewards, you must remember that an agent doesn’t learn “the world”, it learns the world through the lens of your reward.
5. Discount Factor (γ)
The discount factor controls how much future rewards matter.
- γ = 0 → only cares about immediate reward
- γ close to 1 → values long-term reward (used for planning ahead)
In robotics, a well-tuned γ can help an agent balance short-term reactions with long-term strategies (e.g., avoiding risky moves that seem good short-term).
You can think of γ as the agent’s “patience.”
The closer it is to 1, the more far-sighted the behavior.
Example: Defining an MDP for Robotics
Let’s break down how to formulate an MDP for real-world robotic tasks. We define the state, action, and reward, and then use this structure to create a trainable environment.
MDP for a Mobile Robot
- State (S):
Position (x, y), orientation (θ), and sensor data (e.g., LiDAR, IMU, GPS).
This tells the agent where it is and what it sees. - Action (A):
Wheel velocities (left and right).
These control movement and direction. - Reward (R):
+1 for getting closer to the goal
-1 for collision
0 if no progress
The goal is to learn how to reach a target without crashing.
MDP for a Robotic Arm
- State (S):
Joint angles, end-effector position, and optionally camera or force feedback. - Action (A):
Joint torques or desired angles. - Reward (R):
+1 if object is successfully grasped
-1 if object is missed or dropped
0 otherwise
The task could be grasping, stacking, or placing an object.
References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience.
- Wikipedia contributors. (2024). Markov Decision Process. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Markov_decision_process
- Zhang, A., Vinyals, O., Munos, R., & Precup, D. (2020). A Deeper Look at Discounting in Reinforcement Learning.
Reinforcement Learning Agent << Previous | Next >> Choosing RL Algorithm