Model-Free agent

This page was last edited on 23 March 2026

Model-free learning is when an agent learns to act without knowing how the environment works. It doesn’t build a model of the world. It learns only from experience. It apply actions and get rewards. That’s why we call it model free reinforcement learning.

Why use model-free learning in Deep RL?

Because sometimes modeling the environment is too hard or impossible. Environments can be complex, dynamic, or unknown. Model-free methods just need reward signals. It doesn’t needed to simulate transitions. That makes them flexible and widely applicable.

ANALOGY

Imagine a dark room. We don’t know where anything is. We bump into things and slowly learn where to go and what to avoid.

No map. No model. Just experience. That’s model free learning.

HISTORY

Model-free learning started in the 1980s. Q-Learning (Watkins, 1989) was one of the first. It became popular because it works even when the environment is unknown.

First used was in simple games and robotics. Later scaled up with Deep Learning [that’s how we got Deep Q Networks (DQN) from DeepMind (2013–2015)].

How does a Model-Free agent learn?

From experience:

No “mental” simulations.
Just trial and error. Like a baby learning to walk.

Main categories of Model-Free algorithms

  • Value-based: DQN, Double DQN, Dueling DQN
  • Policy-based: REINFORCE, PPO (when no model is used)
  • Actor-Critic: A2C, when no model is involved

Key benefits

  • Works well in real-world systems
  • No need to model the environment
  • Simple to implement
  • Great when dynamics are unknown or too complex to model

Downsides

  • Needs a lot of data (sample inefficient)
  • May be unstable or slow to converge
  • Can’t simulate future actions (no planning)
  • Harder to transfer learning to new tasks

When to use a Model-Free algorithm

  • When the environment model is missing or unreliable
  • When using real robots or physical devices
  • When all data comes from actual interactions
  • When you want to keep things simple and practical

Popular Model-Free algorithms

  • DQN – Discrete actions, but may overestimate Q-values
  • Double/Dueling DQN – Better stability, handles noise
  • REINFORCE – Direct policy learning, but high variance
  • PPO – Robust and widely used in modern RL
  • A2C – Combines policy + value without a model

Technical facts you should know

  • The agent don’t learn the transition function
  • The agent don’t learn the reward function
  • Each action is real, no imagination
  • Learning happens only from actual episodes

What are the inputs and outputs of a free model?

Inputs:

  • State (s)
  • Sometimes reward (r) and next state (s’)
  • Policy-based: gradients and log-probabilities

Outputs:

  • Action (a) — what to do in the state
  • Value (Q(s, a) or V(s)) — used for learning

Bellman Equation << Previous | Next >> Reward Shaping