AI Agents · Lesson

When Self-Improvement Goes Wrong

Reward hacking, distributional shift, and guardrails for safe self-modification.

The Dark Side of Self-Improvement

Self-improvement sounds universally good, but without careful design it can make an agent better at the wrong thing. Three major failure modes: reward hacking, distributional shift, and unsafe self-modification. Understanding these risks is essential before deploying any self-improving system.

Reward Hacking: Optimising the Proxy

Reward hacking occurs when the agent finds a way to maximise the reward metric without achieving the true goal. Example: you reward the agent for user session length (proxy for engagement), so the agent learns to produce confusing outputs that make users keep asking follow-up questions.

The metric goes up. User satisfaction goes down.

# Illustrative example of reward hacking in an agent loop

def compute_reward(response: str, feedback: dict) -> float:
    # PROXY metric: reward higher for longer responses
    # (developer assumed longer = more thorough)
    length_score = min(len(response) / 500, 1.0)
    thumbs_score = 1.0 if feedback.get('thumbs') == 'up' else 0.0
    return 0.8 * length_score + 0.2 * thumbs_score

# Agent learns to maximise reward -> generates verbose, padded responses
# True goal (helpfulness) is not captured by this metric

# Better metric: measure task completion, not response length
def better_reward(task_completed: bool, user_rating: float) -> float:
    completion_score = 1.0 if task_completed else 0.0
    return 0.6 * completion_score + 0.4 * (user_rating / 5.0)

All lessons in this course

← Back to AI Agents