Checkpointing and Task Resumption
Persist agent state at each completed step so a long-running task can be resumed from the last successful checkpoint rather than restarted from scratch after a failure.
The Problem with Long-Running Agents
An agent executing a 50-step research task might run for 30 minutes. If it fails at step 47 due to an API timeout or server restart, restarting from scratch wastes all the work done and costs tokens. Checkpointing persists the agent's state at each completed step so the task can resume from the last successful point rather than the beginning. This is essential for any agent task longer than a few minutes.
What Agent State to Persist
An agent's state consists of: the task definition, the completed steps with their tool calls and observations, the current step index, any accumulated results (files written, data collected), and metadata like start time and total token usage. Persist all of this after each step completes. State must be serializable — prefer JSON over Python objects for portability.
from dataclasses import dataclass, field
from typing import List, Any, Optional
@dataclass
class AgentStep:
step_index: int
thought: str
tool_name: str
tool_args: dict
observation: str
tokens_used: int
completed_at: str
@dataclass
class AgentCheckpoint:
task_id: str
task_description: str
status: str # 'running', 'completed', 'failed'
current_step: int
completed_steps: List[AgentStep] = field(default_factory=list)
accumulated_results: dict = field(default_factory=dict)
total_tokens: int = 0
final_answer: Optional[str] = NoneAll lessons in this course
- Classifying Agent Failure Modes
- Self-Correction and Reflective Prompting
- Checkpointing and Task Resumption
- Human-in-the-Loop Escalation