Claude Architect · Lesson

Local Recovery vs Escalation

Retry transient faults; escalate the non-recoverable.

Two Ways a Step Can Fail

Inside an agentic system, a single tool call or subagent step can fail for very different reasons. The architect's job is to classify the failure before reacting.

Transient fault — a momentary, self-correcting problem: a network blip, a rate limit, a brief timeout. Retrying the exact same call may just work.
Non-recoverable failure — a structural problem: invalid credentials, a missing permission, a malformed request, or a business rule violation. Retrying changes nothing.

The core rule of this lesson: recover transient faults locally, escalate the non-recoverable.

Local Recovery: Keep It in the Subagent

In a hub-and-spoke multi-agent system, the coordinator delegates work to subagents. When a subagent hits a transient fault, it should try to fix it where it happened — without bubbling noise up to the coordinator.

This keeps the coordinator focused on orchestration instead of low-level retries, and it preserves the coordinator's context budget. Local recovery is the first line of defense.

def run_tool_with_local_recovery(tool, args, max_attempts=3):
    for attempt in range(max_attempts):
        result = tool(**args)
        if not result.get("isError"):
            return result
        # Only retry faults the result says are retryable
        if result.get("isRetryable") and result.get("errorCategory") == "transient":
            continue
        break  # validation / business / permission -> stop, escalate
    return result  # hand the structured error upward

All lessons in this course

Clear Escalation Triggers
Anti-Pattern: Sentiment & Confidence Scores
Structured Error Context
Local Recovery vs Escalation

← Back to Claude Architect