AI Agents · Lesson

Output Filtering (Llama Guard, NeMo)

Run a smaller guard model over outputs to catch toxicity, PII leaks, and policy violations before they ship.

Why Filter Outputs?

Even with safe inputs, models can output:

Personal data leaks
Hate speech / harassment
Self-harm content
Tool calls that violate user intent

An output filter is your last line of defense before the user sees anything.

Llama Guard

Meta's safety classifier — open weights, very fast:

from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-Guard-3-8B')
# Outputs 'safe' or 'unsafe' with category.

All lessons in this course

Prompt Injection Defences
Output Filtering (Llama Guard, NeMo)
Sandbox Execution for Code Agents
Access Control on Tools

← Back to AI Agents