Learn AI with Python · Lesson

Policy Gradient Methods: REINFORCE

Policy gradient theorem, REINFORCE algorithm, baseline subtraction, variance reduction.

Value-Based vs Policy-Based RL

Some RL methods learn the value of actions, then act greedily. Policy gradient methods instead learn the policy directly: a function that outputs action probabilities. This handles continuous actions and stochastic policies naturally.

The Policy pi(a|s)

A parameterized policy pi(a|s, theta) maps a state to a probability distribution over actions, with parameters theta (a neural network). Training adjusts theta to favor actions that earn more reward.

class Policy(nn.Module):
    def __init__(self, obs_dim, n_actions):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(obs_dim, 128), nn.ReLU(),
            nn.Linear(128, n_actions)
        )
    def forward(self, s):
        return torch.softmax(self.net(s), dim=-1)

All lessons in this course

← Back to Learn AI with Python