Input and Output Filtering
Blocking unsafe content.
Filtering as the First Line
Filtering inspects content and decides allow, block, or transform. Input filtering protects the model and your cost budget; output filtering protects the user and your reputation. Both rely on a layered mix of fast deterministic checks and slower semantic classifiers.
Prompt-Injection Detection
The signature input threat is prompt injection: text that tries to override your instructions ('ignore previous instructions and...'). Detection combines pattern heuristics with a classifier, and is reinforced by clearly delimiting untrusted content so instructions inside it are treated as data.
SUSPECT = ['ignore previous', 'disregard above', 'system prompt',
'you are now', 'reveal your instructions']
def injection_score(text):
t = text.lower()
return sum(p in t for p in SUSPECT)All lessons in this course
- What Are Guardrails
- Input and Output Filtering
- Schema and Rule Validators
- Self-Critique Validation