This page was last edited on 17 September 2025
An agent is the core decision-maker in Reinforcement Learning(RL).
It observes the environment, takes actions, and learns from the outcomes.
The goal of the agent is to maximize the total reward over time.
In short: Agent = learner + decision-maker.
Why do we use an agent in Deep RL?
In Deep RL, the agent is used to learn optimal behaviors in complex environments. It learns through trial and error—no supervision needed.
The agent replaces hardcoded logic with smart decision-making, powered by deep learning. This makes it scalable and adaptable to real-world problems.
Equation
The learning agent in Deep RL follows the Bellman Equation to learn values:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(s, a) = r + \gamma \cdot \max_{a'} Q(s', a') \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-734060c62bebe1d413a5a860dc9566ec_l3.png)
Where:
- Q(s,a): the value of taking action a in state s
- r: immediate reward
- γ: discount factor (future reward importance)
- s′: next state
- a′: next possible action
This is how the agent updates its knowledge.
ANALOGY
Imagine a robot vacuum.
It learns where dirt collects and which areas to clean first.
At first, it bumps into walls (random actions).
Later, it learns which paths give the cleanest floors (best rewards).
That robot is the agent—it learns by interacting with the world.
HISTORY
The concept of an agent dates back to the 1950s in cybernetics. In RL, it was formalized by Richard Sutton in the 1980s.
One of the first known uses was TD-Gammon (1992), an RL agent that learned to play backgammon at expert level.
Since then, agents have powered AlphaGo, robotics, and self-driving cars.
Steps for implementing an Agent
A Deep RL agent typically follows these steps:
- Observe the environment (state).
- Decide what action to take (policy).
- Act in the environment.
- Receive reward and observe the new state.
- Learn from the transition using the Bellman update or gradients.
- Repeat.
Agent inputs and outputs
Inputs:
- Current state s
- Past experiences (in some cases)
- Hyperparameters (learning rate, γ, ε, etc.)
Outputs:
- Action a
- Updated policy or value function
- Learned model parameters
EXAMPLE: How an Agent Learns Better Q-values
We assume that:
- γ=0.9
- Learning rate = 1.0 (for simplicity)
- Initial Q-values: Q(s,a) = 0
- Rewards per step: 1.5, 2.0, 2.5, 3.0, 3.2
Bellman update:
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle Q(s, a) = r + \gamma \cdot \max_{a'} Q(s', a') \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-734060c62bebe1d413a5a860dc9566ec_l3.png)
ITERATION 1
- r = 1.5, Q = 0
- Q(s,a) = 1.5 + 0.9 * 0 = 1.5
ITERATION 2
- r = 2.0, Q = 1.5
- Q(s,a) = 2.0 + 0.9 * 1.5 = 3.35
ITERATION 3
- r = 2.5, Q = 3.35
- Q(s,a) = 2.5 + 0.9 * 3.35 = 5.515
ITERATION 4
- r = 3.0, Q = 5.515
- Q(s,a) = 3.0 + 0.9 * 5.515 = 7.9635
ITERATION 5
- r = 3.2, Q = 7.9635
- Q(s,a) = 3.2 + 0.9 * 7.9635 = 10.36715

We can see in the above graphic how the agent is improving. With each interaction, it learns better Q-values. The graph shows a steady increase in cumulative reward.
References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search
On-Policy vs Off-Policy Learning << Previous | Next >> Markov Decision Process(MDP)