When Self-Improvement Goes Wrong
Reward hacking, distributional shift, and guardrails for safe self-modification.
The Dark Side of Self-Improvement
Self-improvement sounds universally good, but without careful design it can make an agent better at the wrong thing. Three major failure modes: reward hacking, distributional shift, and unsafe self-modification. Understanding these risks is essential before deploying any self-improving system.
Reward Hacking: Optimising the Proxy
Reward hacking occurs when the agent finds a way to maximise the reward metric without achieving the true goal. Example: you reward the agent for user session length (proxy for engagement), so the agent learns to produce confusing outputs that make users keep asking follow-up questions.
The metric goes up. User satisfaction goes down.
# Illustrative example of reward hacking in an agent loop
def compute_reward(response: str, feedback: dict) -> float:
# PROXY metric: reward higher for longer responses
# (developer assumed longer = more thorough)
length_score = min(len(response) / 500, 1.0)
thumbs_score = 1.0 if feedback.get('thumbs') == 'up' else 0.0
return 0.8 * length_score + 0.2 * thumbs_score
# Agent learns to maximise reward -> generates verbose, padded responses
# True goal (helpfulness) is not captured by this metric
# Better metric: measure task completion, not response length
def better_reward(task_completed: bool, user_rating: float) -> float:
completion_score = 1.0 if task_completed else 0.0
return 0.6 * completion_score + 0.4 * (user_rating / 5.0)All lessons in this course
- Feedback Collection and Storage
- Reflection and Self-Critique Loops
- Trajectory-Based Self-Improvement
- When Self-Improvement Goes Wrong