AI Engineering Academy · Lesson

Checkpointing and Task Resumption

Persist agent state at each completed step so a long-running task can be resumed from the last successful checkpoint rather than restarted from scratch after a failure.

The Problem with Long-Running Agents

An agent executing a 50-step research task might run for 30 minutes. If it fails at step 47 due to an API timeout or server restart, restarting from scratch wastes all the work done and costs tokens. Checkpointing persists the agent's state at each completed step so the task can resume from the last successful point rather than the beginning. This is essential for any agent task longer than a few minutes.

What Agent State to Persist

An agent's state consists of: the task definition, the completed steps with their tool calls and observations, the current step index, any accumulated results (files written, data collected), and metadata like start time and total token usage. Persist all of this after each step completes. State must be serializable — prefer JSON over Python objects for portability.

from dataclasses import dataclass, field
from typing import List, Any, Optional

@dataclass
class AgentStep:
    step_index: int
    thought: str
    tool_name: str
    tool_args: dict
    observation: str
    tokens_used: int
    completed_at: str

@dataclass
class AgentCheckpoint:
    task_id: str
    task_description: str
    status: str  # 'running', 'completed', 'failed'
    current_step: int
    completed_steps: List[AgentStep] = field(default_factory=list)
    accumulated_results: dict = field(default_factory=dict)
    total_tokens: int = 0
    final_answer: Optional[str] = None

All lessons in this course

Classifying Agent Failure Modes
Self-Correction and Reflective Prompting
Checkpointing and Task Resumption
Human-in-the-Loop Escalation

← Back to AI Engineering Academy