Root Cause Analysis for Agent Failures
Systematic failure taxonomy: model error, tool error, data error, logic error.
Failure Taxonomy for Agents
Agent failures fall into four categories:
- Model error: LLM calls the wrong tool or generates bad output
- Tool error: An external API fails or returns unexpected data
- Data error: Bad input (malformed, missing fields, unexpected types)
- Logic error: Correct steps but in the wrong sequence or with wrong assumptions
Model Errors: Wrong Tool Call
Model errors happen when the LLM selects the wrong tool, passes incorrect arguments, or generates malformed JSON. These are often caused by unclear tool descriptions or ambiguous prompts.
import openai
import json
client = openai.OpenAI(api_key='sk-...')
def detect_model_errors(response) -> list:
errors = []
message = response.choices[0].message
if message.tool_calls:
for tc in message.tool_calls:
tool_name = tc.function.name
try:
args = json.loads(tc.function.arguments)
except json.JSONDecodeError as e:
errors.append({
'type': 'model_error',
'subtype': 'malformed_tool_args',
'tool': tool_name,
'raw_args': tc.function.arguments,
'parse_error': str(e)
})
continue
# Validate required arguments
expected_tools = {
'search_web': ['query'],
'send_email': ['to', 'subject', 'body'],
'create_task': ['title']
}
required = expected_tools.get(tool_name, [])
missing = [r for r in required if r not in args]
if missing:
errors.append({
'type': 'model_error',
'subtype': 'missing_required_args',
'tool': tool_name,
'missing': missing
})
return errors
print('Model error detection function defined')All lessons in this course
- Trace Analysis with LangSmith and Langfuse
- Per-Step Token and Cost Profiling
- Identifying Slow and Expensive Steps
- Root Cause Analysis for Agent Failures