Proximal Policy Optimization (PPO)
Clipped surrogate objective, GAE advantage estimation, PPO with stable-baselines3.
Why PPO?
PPO is today's default RL algorithm: stable, sample-efficient, and easy to tune. It fixes a core problem of policy gradients, updates that are too large and destroy the policy, by limiting how much the policy can change per update.
The Danger of Big Updates
A single overly large policy update can collapse performance, and because RL is on-policy, recovery is hard. PPO keeps each update proximal (close) to the current policy to stay safe.
All lessons in this course
- Policy Gradient Methods: REINFORCE
- Actor-Critic Methods (A2C)
- Proximal Policy Optimization (PPO)
- Custom Gymnasium Environments