Why Testing Agents Is Different
Non-determinism, LLM cost, and why standard unit tests fall short.
Testing Software vs. Testing Agents
Traditional software is deterministic: give it the same input, get the same output. Unit tests rely on this property to assert exact expected values.
AI agents break this assumption. The same prompt can produce different outputs each run, making standard testing approaches insufficient on their own.
Non-Determinism: Same Input, Different Output
LLMs are probabilistic by nature. The temperature parameter controls randomness — even at temperature=0, outputs can vary across model versions or infrastructure changes.
This means an agent test that passes today may fail tomorrow with no code change.
import openai
client = openai.OpenAI(api_key='YOUR_API_KEY')
# Same prompt, potentially different outputs each run
for i in range(3):
response = client.chat.completions.create(
model='gpt-4o-mini',
messages=[{'role': 'user', 'content': 'Name a planet.'}],
temperature=0.9 # High randomness
)
print(f'Run {i+1}: {response.choices[0].message.content}')
# Run 1: Mars
# Run 2: Jupiter
# Run 3: SaturnAll lessons in this course
- Why Testing Agents Is Different
- Mocking LLM Calls in Tests
- Assertion-Based Agent Testing
- Integration Tests for Agent Pipelines