AI Prompt Engineering · Lesson

Jailbreak Techniques

How attacks bypass guardrails.

Why Jailbreaks Work

A jailbreak is input crafted to make a model bypass its safety alignment or your system instructions. They work because instruction-following and safety are both learned behaviors in tension; an attacker engineers a context where following the malicious instruction wins.

Understanding the mechanics lets you defend, not exploit.

Instruction Override

The simplest class directly contradicts the system prompt: 'Ignore all previous instructions and...'. Modern models resist this, but variants persist when the system prompt is weak or buried in a long context. Defense: keep critical rules salient and treat user text as lower-priority than system policy by design.

All lessons in this course

LLM Red-Teaming Basics
Jailbreak Techniques
Building an Attack Suite
Measuring Robustness

← Back to AI Prompt Engineering