Output Filtering (Llama Guard, NeMo)
Run a smaller guard model over outputs to catch toxicity, PII leaks, and policy violations before they ship.
Why Filter Outputs?
Even with safe inputs, models can output:
- Personal data leaks
- Hate speech / harassment
- Self-harm content
- Tool calls that violate user intent
An output filter is your last line of defense before the user sees anything.
Llama Guard
Meta's safety classifier — open weights, very fast:
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-Guard-3-8B')
# Outputs 'safe' or 'unsafe' with category.All lessons in this course
- Prompt Injection Defences
- Output Filtering (Llama Guard, NeMo)
- Sandbox Execution for Code Agents
- Access Control on Tools