Building Multi-Agent Reasoning Systems with Python: A Deep Dive into Collaborative LLM Architectures
Multi-agent systems have moved from academic papers to production code faster than most trends in AI. The core insight is simple but powerful: instead of asking one LLM to do everything, decompose complex tasks into specialized roles that collaborate — each with focused context, distinct prompts, and clear handoff protocols.
Recent research (TSQAgent, Code-on-Graph, NovelAPIBench) demonstrates that role-specialized agents consistently outperform monolithic prompts. This tutorial shows you how to build a production-grade multi-agent reasoning system from scratch in Python.
Why Multi-Agent? The Limits of Single-Prompt Reasoning
A single LLM prompt trying to plan, execute, verify, and critique its own output suffers from several well-documented problems:
- Context pollution: As the conversation grows, signal-to-noise ratio drops. Critical instructions get diluted.
- Self-verification blind spots: An LLM rarely catches its own errors when reviewing its own reasoning chain.
- Prompt bloat: Packing system prompts for planning, execution, formatting, and validation creates a fragile mega-prompt.
- No parallelization: Everything runs sequentially in one context window.
Multi-agent architectures solve these by giving each agent a narrow, well-defined responsibility and its own context window.
Architecture: The Perceiver–Executor–Reviewer Pattern
The pattern we implement today uses three specialized roles:
- Perceiver (Planner): Analyzes the input, identifies required sub-tasks, and creates an execution plan.
- Executor (Worker): Carries out each sub-task using appropriate tools or API calls.
- Reviewer (Critic): Validates outputs, flags errors, and triggers re-execution when needed.
This mirrors the architecture in recent papers where agentic systems use Perceiver, Inspector, and Adjudicator roles for quality assessment tasks.
Implementation
Step 1: Define the Agent Base Class
from dataclasses import dataclass, field
from typing import Optional
import openai
import json
@dataclass
class AgentMessage:
role: str # "perceiver", "executor", "reviewer"
content: str
metadata: dict = field(default_factory=dict)
@dataclass
class AgentState:
messages: list = field(default_factory=list)
task_plan: Optional[list] = None
results: list = field(default_factory=list)
review_flags: list = field(default_factory=list)
iterations: int = 0
max_iterations: int = 3
class Agent:
def __init__(self, name: str, role: str, system_prompt: str, model: str = "gpt-4"):
self.name = name
self.role = role
self.system_prompt = system_prompt
self.model = model
self.client = openai.OpenAI()
def act(self, context: str, history: list = None) -> str:
messages = [{"role": "system", "content": self.system_prompt}]
if history:
messages.extend(history)
messages.append({"role": "user", "content": context})
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
temperature=0.3 if self.role == "reviewer" else 0.7,
)
return response.choices[0].message.content
Step 2: Implement the Three Roles
PERCEIVER_PROMPT = """You are a task decomposition expert. Given a complex problem, break it down into sequential or parallel sub-tasks.
Output a JSON array of tasks. Each task has:
- "id": unique integer
- "description": what needs to be done
- "type": "research", "computation", "synthesis", or "validation"
- "depends_on": list of task IDs that must complete first
Be specific. Tasks should be independently executable.
Return ONLY valid JSON, no markdown formatting."""
EXECUTOR_PROMPT = """You are a specialist that executes tasks precisely. Given a task description, produce a detailed result.
- For research tasks: gather and summarize relevant information
- For computation tasks: show step-by-step calculations
- For synthesis tasks: combine multiple inputs into a coherent output
- For validation tasks: check correctness against specified criteria
Be thorough and cite any assumptions made."""
REVIEWER_PROMPT = """You are a rigorous reviewer. Given a task and its result, evaluate:
1. Completeness: Does the result fully address the task?
2. Correctness: Are there logical errors or factual mistakes?
3. Clarity: Is the result well-structured and understandable?
If the result passes all checks, respond with: {"status": "approved"}
If it fails, respond with: {"status": "rejected", "issues": ["issue 1", "issue 2"], "suggestion": "how to fix"}
Be strict. False positives are better than missed errors."""
perceiver = Agent("Planner", "perceiver", PERCEIVER_PROMPT)
executor = Agent("Worker", "executor", EXECUTOR_PROMPT)
reviewer = Agent("Critic", "reviewer", REVIEWER_PROMPT)
Step 3: The Orchestrator
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
class MultiAgentOrchestrator:
def __init__(self, perceiver: Agent, executor: Agent, reviewer: Agent):
self.perceiver = perceiver
self.executor = executor
self.reviewer = reviewer
self.state = AgentState()
def run(self, problem: str) -> dict:
# Phase 1: Perception -- decompose the problem
plan_json = self.perceiver.act(problem)
try:
tasks = json.loads(plan_json)
except json.JSONDecodeError:
# Fallback: treat as single task
tasks = [{"id": 1, "description": problem, "type": "synthesis", "depends_on": []}]
self.state.task_plan = tasks
print(f"[Orchestrator] Plan: {len(tasks)} tasks identified")
# Phase 2: Execution with dependency resolution
completed = {}
for task in sorted(tasks, key=lambda t: len(t.get("depends_on", []))):
deps_met = all(d in completed for d in task.get("depends_on", []))
if not deps_met:
continue
# Build context from dependencies
dep_context = ""
for dep_id in task.get("depends_on", []):
dep_context += f"\n--- Dependency Task {dep_id} Result ---\n{completed[dep_id]}\n"
context = f"Task: {task['description']}\nType: {task['type']}{dep_context}"
result = self.executor.act(context)
# Phase 3: Review
review_result = self._review_task(task, result)
if review_result["status"] == "rejected":
# Retry with feedback
for attempt in range(self.state.max_iterations):
feedback = review_result.get("suggestion", "Improve the result")
retry_context = f"Task: {task['description']}\nPrevious attempt was rejected.\nReviewer feedback: {feedback}\n\nPlease redo the task."
result = self.executor.act(retry_context)
review_result = self._review_task(task, result)
if review_result["status"] == "approved":
break
completed[task["id"]] = result
self.state.results.append({
"task_id": task["id"],
"result": result,
"review": review_result
})
print(f"[Orchestrator] Task {task['id']} {'approved' if review_result['status'] == 'approved' else 'completed with flags'}")
# Phase 4: Synthesis
final = self._synthesize(completed)
return {"plan": tasks, "results": self.state.results, "final_output": final}
def _review_task(self, task, result) -> dict:
review_text = self.reviewer.act(
f"Task: {task['description']}\n\nResult:\n{result}"
)
try:
return json.loads(review_text)
except json.JSONDecodeError:
return {"status": "approved"}
def _synthesize(self, completed: dict) -> str:
all_results = "\n\n".join(
f"### Task {tid}\n{result}" for tid, result in completed.items()
)
return self.executor.act(
f"Synthesize these task results into a cohesive final answer:\n\n{all_results}"
)
Step 4: Using the System
orchestrator = MultiAgentOrchestrator(perceiver, executor, reviewer)
result = orchestrator.run(
"Analyze the time series data quality of our IoT sensor dataset. "
"Check for missing values, outliers, seasonal patterns, and sensor drift. "
"Provide a quality score and remediation recommendations."
)
print(result["final_output"])
Advanced: Parallel Execution with ThreadPoolExecutor
For independent tasks, you can parallelize execution:
def execute_parallel_tasks(self, tasks: list, context: str = "") -> dict:
results = {}
with ThreadPoolExecutor(max_workers=3) as pool:
futures = {}
for task in tasks:
futures[pool.submit(self.executor.act, f"{context}\nTask: {task['description']}")] = task
for future in as_completed(futures):
task = futures[future]
results[task["id"]] = future.result()
return results
Key Design Principles
- Role isolation: Each agent has a single responsibility. The Planner never executes; the Critic never generates.
- Structured communication: Agents communicate through JSON schemas and explicit handoffs, not free-form chat.
- Iterative refinement: The review loop allows correction without restarting the entire pipeline.
- Dependency-aware scheduling: Tasks execute in topological order, respecting dependencies.
When to Use Multi-Agent vs. Single Prompt
Use multi-agent when:
- The task has 3+ distinct phases with different reasoning modes
- You need verifiable outputs with review cycles
- Parallelization provides meaningful speedup
- Different parts of the task benefit from different temperature/settings
Stick with single-prompt when the task is straightforward, latency-sensitive, or the overhead of agent coordination outweighs the quality gains.
What's Next
From here, you can extend the system with tool use (giving agents access to code execution, web search, or database queries), persistent memory between sessions, or dynamic role assignment where the Planner decides how many agents are needed based on task complexity.
The research trajectory is clear: agentic systems are replacing monolithic prompts in production. The question is not whether to adopt them, but how to design them well.