Jailbreak Techniques
How attacks bypass guardrails.
Why Jailbreaks Work
A jailbreak is input crafted to make a model bypass its safety alignment or your system instructions. They work because instruction-following and safety are both learned behaviors in tension; an attacker engineers a context where following the malicious instruction wins.
Understanding the mechanics lets you defend, not exploit.
Instruction Override
The simplest class directly contradicts the system prompt: 'Ignore all previous instructions and...'. Modern models resist this, but variants persist when the system prompt is weak or buried in a long context. Defense: keep critical rules salient and treat user text as lower-priority than system policy by design.
All lessons in this course
- LLM Red-Teaming Basics
- Jailbreak Techniques
- Building an Attack Suite
- Measuring Robustness