Learn AI with Python · Lesson

Proximal Policy Optimization (PPO)

Clipped surrogate objective, GAE advantage estimation, PPO with stable-baselines3.

Why PPO?

PPO is today's default RL algorithm: stable, sample-efficient, and easy to tune. It fixes a core problem of policy gradients, updates that are too large and destroy the policy, by limiting how much the policy can change per update.

The Danger of Big Updates

A single overly large policy update can collapse performance, and because RL is on-policy, recovery is hard. PPO keeps each update proximal (close) to the current policy to stay safe.

All lessons in this course

Policy Gradient Methods: REINFORCE
Actor-Critic Methods (A2C)
Proximal Policy Optimization (PPO)
Custom Gymnasium Environments

← Back to Learn AI with Python