Policy Gradient Methods: REINFORCE
Policy gradient theorem, REINFORCE algorithm, baseline subtraction, variance reduction.
Value-Based vs Policy-Based RL
Some RL methods learn the value of actions, then act greedily. Policy gradient methods instead learn the policy directly: a function that outputs action probabilities. This handles continuous actions and stochastic policies naturally.
The Policy pi(a|s)
A parameterized policy pi(a|s, theta) maps a state to a probability distribution over actions, with parameters theta (a neural network). Training adjusts theta to favor actions that earn more reward.
class Policy(nn.Module):
def __init__(self, obs_dim, n_actions):
super().__init__()
self.net = nn.Sequential(
nn.Linear(obs_dim, 128), nn.ReLU(),
nn.Linear(128, n_actions)
)
def forward(self, s):
return torch.softmax(self.net(s), dim=-1)All lessons in this course
- Policy Gradient Methods: REINFORCE
- Actor-Critic Methods (A2C)
- Proximal Policy Optimization (PPO)
- Custom Gymnasium Environments