0PricingLogin
AI Agents · Lesson

Output Filtering (Llama Guard, NeMo)

Run a smaller guard model over outputs to catch toxicity, PII leaks, and policy violations before they ship.

Why Filter Outputs?

Even with safe inputs, models can output:

  • Personal data leaks
  • Hate speech / harassment
  • Self-harm content
  • Tool calls that violate user intent

An output filter is your last line of defense before the user sees anything.

Llama Guard

Meta's safety classifier — open weights, very fast:

from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-Guard-3-8B')
# Outputs 'safe' or 'unsafe' with category.

All lessons in this course

  1. Prompt Injection Defences
  2. Output Filtering (Llama Guard, NeMo)
  3. Sandbox Execution for Code Agents
  4. Access Control on Tools
← Back to AI Agents