Local Recovery vs Escalation
Retry transient faults; escalate the non-recoverable.
Two Ways a Step Can Fail
Inside an agentic system, a single tool call or subagent step can fail for very different reasons. The architect's job is to classify the failure before reacting.
- Transient fault — a momentary, self-correcting problem: a network blip, a rate limit, a brief timeout. Retrying the exact same call may just work.
- Non-recoverable failure — a structural problem: invalid credentials, a missing permission, a malformed request, or a business rule violation. Retrying changes nothing.
The core rule of this lesson: recover transient faults locally, escalate the non-recoverable.
Local Recovery: Keep It in the Subagent
In a hub-and-spoke multi-agent system, the coordinator delegates work to subagents. When a subagent hits a transient fault, it should try to fix it where it happened — without bubbling noise up to the coordinator.
This keeps the coordinator focused on orchestration instead of low-level retries, and it preserves the coordinator's context budget. Local recovery is the first line of defense.
def run_tool_with_local_recovery(tool, args, max_attempts=3):
for attempt in range(max_attempts):
result = tool(**args)
if not result.get("isError"):
return result
# Only retry faults the result says are retryable
if result.get("isRetryable") and result.get("errorCategory") == "transient":
continue
break # validation / business / permission -> stop, escalate
return result # hand the structured error upwardAll lessons in this course
- Clear Escalation Triggers
- Anti-Pattern: Sentiment & Confidence Scores
- Structured Error Context
- Local Recovery vs Escalation