This page was last edited on 13 June 2025
Proximal Policy Optimization(PPO) is a policy gradient algorithm. It trains an agent to make better decisions by gently updating the policy using a clipped objective function. Simple, stable, and powerful.
Why Choose PPO?
Easy to tune, works in most environments, and provides stable training. Better than vanilla policy gradients. Easier to implement than TRPO.
What Type of Learning Is It?
Reinforcement Learning, specifically policy-based using neural networks.
Model-Free or Model-Based?
Model-Free.
PPO doesn’t try to learn the environment. It just interacts with it and learns what to do.
What Is It Trying to Compute?
A stochastic policy π(a|s), which tells the agent the probability of taking action a in state s. It also learns a value function V(s).
Training Loop
- Collect experience by running the current policy.
- Compute advantages using GAE.
- Update policy using clipped loss.
- Repeat until convergence.
You can read a Complete Practical Guide to PPO with Stable-Baselines3 with reproducible code, recommended settings, and TensorBoard logging.
On-Policy or Off-Policy?
On-Policy. PPO uses data generated from the current policy only. It can’t reuse old experience.
Exploration vs. Exploitation?
PPO balances both using:
- Entropy bonus → keeps exploration alive.
- Advantage estimation → improves exploitation.
When Does It Converge?
Converges when the policy stops improving and KL-divergence between old and new policy is low. Depends on reward signal and environment complexity.
Where Does It Struggle?
- Sparse rewards
- Partial observability
- Real-time or hardware-limited systems (due to on-policy nature)
What Problems Is It Good For?
- Continuous control
- Robotics
- Tasks where stability matters more than sample-efficiency
Common Traps and Mistakes
- Wrong clipping value (too high = unstable, too low = undertrained)
- Improper advantage normalization
- Ignoring KL divergence
- Too few epochs or small batch size
- Forgetting to reset the environment
PPO Clipped Objective Equation
![Rendered by QuickLaTeX.com \[ \hspace{5mm} \fbox{ \begin{array}{c} \vspace{5mm} \\ \displaystyle L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \cdot \hat{A}_t, \, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \cdot \hat{A}_t \right) \right] \\ \vspace{5mm} \end{array} } \hspace{5mm} \]](https://www.reinforcementlearningpath.com/wp-content/ql-cache/quicklatex.com-c53007713d805112dda3ba8324d24ca1_l3.png)
Where:
- θ – The current parameters of the policy network (weights we want to update)
- πθ(at∣st) – The probability (under the current policy) of taking action at in state st.
- πθold(at∣st) – The probability (under the old policy) of taking the same action at in the same state st, before the update.
- rt(θ) – The probability ratio between the new and old policy. It measures how much the new policy changed.
- A^t – The advantage estimate at time step t. It tells us if the action was better (> 0) or worse (< 0) than expected.
- ϵ – The clipping threshold, usually 0.1 or 0.2. It limits how far the policy is allowed to change.
- clip(rt,1 − ϵ, 1 + ϵ) – This clips the ratio rt to stay within a safe range. Prevents overly large updates that might destabilize learning.
- min(⋅,⋅) – PPO takes the minimum between the unclipped and clipped objective to ensure that only “safe” updates are applied.
- Et[⋅] – Expectation over time steps (or batch samples). In practice, we average the loss across the batch.
References:
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms
- Spinning Up in Deep RL – OpenAI (2018). PPO Section from OpenAI educational repo.
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
Dueling DQN << Previous | Next >> Soft Actor-Critic (SAC)