Meet the Learners: RL Agents

This page was last edited on 17 September 2025

An agent is the core decision-maker in Reinforcement Learning(RL).

It observes the environment, takes actions, and learns from the outcomes.

The goal of the agent is to maximize the total reward over time.

In short: Agent = learner + decision-maker.

Why do we use an agent in Deep RL?

In Deep RL, the agent is used to learn optimal behaviors in complex environments. It learns through trial and error—no supervision needed.

The agent replaces hardcoded logic with smart decision-making, powered by deep learning. This makes it scalable and adaptable to real-world problems.

Equation

The learning agent in Deep RL follows the Bellman Equation to learn values:

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          Q(s, a) = r + \gamma \cdot \max_{a'} Q(s', a') \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

Where:

  • Q(s,a): the value of taking action a in state s
  • r: immediate reward
  • γ: discount factor (future reward importance)
  • s′: next state
  • a′: next possible action

This is how the agent updates its knowledge.

ANALOGY

Imagine a robot vacuum.
It learns where dirt collects and which areas to clean first.
At first, it bumps into walls (random actions).
Later, it learns which paths give the cleanest floors (best rewards).
That robot is the agent—it learns by interacting with the world.

HISTORY

The concept of an agent dates back to the 1950s in cybernetics. In RL, it was formalized by Richard Sutton in the 1980s.

One of the first known uses was TD-Gammon (1992), an RL agent that learned to play backgammon at expert level.

Since then, agents have powered AlphaGo, robotics, and self-driving cars.

Steps for implementing an Agent

A Deep RL agent typically follows these steps:

  1. Observe the environment (state).
  2. Decide what action to take (policy).
  3. Act in the environment.
  4. Receive reward and observe the new state.
  5. Learn from the transition using the Bellman update or gradients.
  6. Repeat.

Agent inputs and outputs

Inputs:

  • Current state s
  • Past experiences (in some cases)
  • Hyperparameters (learning rate, γ, ε, etc.)

Outputs:

  • Action a
  • Updated policy or value function
  • Learned model parameters

We assume that:

  • γ=0.9
  • Learning rate = 1.0 (for simplicity)
  • Initial Q-values: Q(s,a) = 0
  • Rewards per step: 1.5, 2.0, 2.5, 3.0, 3.2

Bellman update:

    \[ \hspace{5mm} \fbox{     \begin{array}{c}         \vspace{5mm} \\         \displaystyle          Q(s, a) = r + \gamma \cdot \max_{a'} Q(s', a') \\         \vspace{5mm}     \end{array} } \hspace{5mm} \]

ITERATION 1

  • r = 1.5, Q = 0
  • Q(s,a) = 1.5 + 0.9 * 0 = 1.5

ITERATION 2

  • r = 2.0, Q = 1.5
  • Q(s,a) = 2.0 + 0.9 * 1.5 = 3.35

ITERATION 3

  • r = 2.5, Q = 3.35
  • Q(s,a) = 2.5 + 0.9 * 3.35 = 5.515

ITERATION 4

  • r = 3.0, Q = 5.515
  • Q(s,a) = 3.0 + 0.9 * 5.515 = 7.9635

ITERATION 5

  • r = 3.2, Q = 7.9635
  • Q(s,a) = 3.2 + 0.9 * 7.9635 = 10.36715

Agent Performance Over 5 Iterations
Agent Performance Over 5 Iterations

We can see in the above graphic how the agent is improving. With each interaction, it learns better Q-values. The graph shows a steady increase in cumulative reward.


References:


On-Policy vs Off-Policy Learning << Previous | Next >> Markov Decision Process(MDP)