This page was last edited on 12 November 2025
On-Policy Learning uses the same policy to generate actions and learn from them.
Off-Policy Learning learns from one policy (target) while collecting data using another (behavior).
In short:
- On-policy: Learn from what you do.
- Off-policy: Learn from what others (or past versions of yourself) did.
Why do we use On-Policy vs Off-Policy Learning in Deep RL?
We use them to control how agents explore and learn from experience.
- On-policy is good for safe, stable learning, but often inefficient.
- Off-policy is more sample efficient and allows learning from past or other agents’ experiences.
Choosing between them depends on:
- data availability;
- task complexity;
- and computation limits;
ANALOGY
On-policy is like a chef who learns only by tasting their own cooking.
Off-policy is like a chef who learns by watching others cook and improve recipes based on observations.
HISTORY
The distinction started early in RL theory.
- Q-learning (1989): First major off-policy algorithm.
- SARSA (1996): Popular on-policy method.
- Deep RL extended these ideas:
- DQN (2015) = off-policy
- A3C/PPO (2016–2017) = on-policy
The split was necessary to deal with limited compute and data.
On-Policy vs Off-Policy implementation steps
On-Policy (e.g. PPO)
- Collect trajectories using current policy.
- Calculate returns and advantages.
- Update the policy with gradient ascent.
- Repeat.
Off-Policy (e.g. DQN)
- Collect experiences into replay buffer.
- Sample mini-batch randomly.
- Compute target Q-values.
- Update network using loss.
- Repeat.
References:
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- Schulman et al. (2017). Proximal Policy Optimization Algorithms (PPO – On-Policy)
Reward Shaping << Previous | Next >> Agent