Harmlessness vs Helpfulness Tension
Navigating over-refusal: prompts that balance safety with utility.
The Core Tension
Every AI safety system faces a fundamental tension: making a model safer by refusing more often also makes it less helpful to legitimate users.
A model that refuses to discuss anything medical will never give dangerous health advice — but it also won't help nurses, doctors, or patients with genuinely useful information. Calibration is the goal: refuse when you should, help when you should.
Over-Refusal: The Real Problem
Over-refusal occurs when a model declines to answer benign requests because they superficially resemble harmful ones. Examples:
- Refusing to explain how diseases spread (sounds dangerous, is actually public health education)
- Refusing to write a villain character (fiction, not endorsement)
- Refusing to explain lock-picking for a locksmith apprentice
Over-refusal is not just annoying — it makes the model untrustworthy and pushes users to less safe alternatives.
All lessons in this course
- CAI Principles and Critique Prompts
- Self-Critique and Revision Patterns
- Harmlessness vs Helpfulness Tension
- Implementing CAI in Applications