Building Multi-Agent Reasoning Systems with Python: A Deep Dive into Collaborative LLM Architectures

Multi-agent systems have moved from academic papers to production code faster than most trends in AI. The core insight is simple but powerful: instead of asking one LLM to do everything, decompose complex tasks into specialized roles that collaborate — each with focused context, distinct prompts, and clear handoff protocols.

Recent research (TSQAgent, Code-on-Graph, NovelAPIBench) demonstrates that role-specialized agents consistently outperform monolithic prompts. This tutorial shows you how to build a production-grade multi-agent reasoning system from scratch in Python.

Why Multi-Agent? The Limits of Single-Prompt Reasoning

A single LLM prompt trying to plan, execute, verify, and critique its own output suffers from several well-documented problems:

  • Context pollution: As the conversation grows, signal-to-noise ratio drops. Critical instructions get diluted.
  • Self-verification blind spots: An LLM rarely catches its own errors when reviewing its own reasoning chain.
  • Prompt bloat: Packing system prompts for planning, execution, formatting, and validation creates a fragile mega-prompt.
  • No parallelization: Everything runs sequentially in one context window.

Multi-agent architectures solve these by giving each agent a narrow, well-defined responsibility and its own context window.

Architecture: The Perceiver–Executor–Reviewer Pattern

The pattern we implement today uses three specialized roles:

  • Perceiver (Planner): Analyzes the input, identifies required sub-tasks, and creates an execution plan.
  • Executor (Worker): Carries out each sub-task using appropriate tools or API calls.
  • Reviewer (Critic): Validates outputs, flags errors, and triggers re-execution when needed.

This mirrors the architecture in recent papers where agentic systems use Perceiver, Inspector, and Adjudicator roles for quality assessment tasks.

Implementation

Step 1: Define the Agent Base Class

from dataclasses import dataclass, field
from typing import Optional
import openai
import json

@dataclass
class AgentMessage:
    role: str  # "perceiver", "executor", "reviewer"
    content: str
    metadata: dict = field(default_factory=dict)

@dataclass
class AgentState:
    messages: list = field(default_factory=list)
    task_plan: Optional[list] = None
    results: list = field(default_factory=list)
    review_flags: list = field(default_factory=list)
    iterations: int = 0
    max_iterations: int = 3

class Agent:
    def __init__(self, name: str, role: str, system_prompt: str, model: str = "gpt-4"):
        self.name = name
        self.role = role
        self.system_prompt = system_prompt
        self.model = model
        self.client = openai.OpenAI()

    def act(self, context: str, history: list = None) -> str:
        messages = [{"role": "system", "content": self.system_prompt}]
        if history:
            messages.extend(history)
        messages.append({"role": "user", "content": context})

        response = self.client.chat.completions.create(
            model=self.model,
            messages=messages,
            temperature=0.3 if self.role == "reviewer" else 0.7,
        )
        return response.choices[0].message.content

Step 2: Implement the Three Roles

PERCEIVER_PROMPT = """You are a task decomposition expert. Given a complex problem, break it down into sequential or parallel sub-tasks.
Output a JSON array of tasks. Each task has:
- "id": unique integer
- "description": what needs to be done
- "type": "research", "computation", "synthesis", or "validation"
- "depends_on": list of task IDs that must complete first

Be specific. Tasks should be independently executable.
Return ONLY valid JSON, no markdown formatting."""

EXECUTOR_PROMPT = """You are a specialist that executes tasks precisely. Given a task description, produce a detailed result.
- For research tasks: gather and summarize relevant information
- For computation tasks: show step-by-step calculations
- For synthesis tasks: combine multiple inputs into a coherent output
- For validation tasks: check correctness against specified criteria

Be thorough and cite any assumptions made."""

REVIEWER_PROMPT = """You are a rigorous reviewer. Given a task and its result, evaluate:
1. Completeness: Does the result fully address the task?
2. Correctness: Are there logical errors or factual mistakes?
3. Clarity: Is the result well-structured and understandable?

If the result passes all checks, respond with: {"status": "approved"}
If it fails, respond with: {"status": "rejected", "issues": ["issue 1", "issue 2"], "suggestion": "how to fix"}

Be strict. False positives are better than missed errors."""

perceiver = Agent("Planner", "perceiver", PERCEIVER_PROMPT)
executor = Agent("Worker", "executor", EXECUTOR_PROMPT)
reviewer = Agent("Critic", "reviewer", REVIEWER_PROMPT)

Step 3: The Orchestrator

import json
from concurrent.futures import ThreadPoolExecutor, as_completed

class MultiAgentOrchestrator:
    def __init__(self, perceiver: Agent, executor: Agent, reviewer: Agent):
        self.perceiver = perceiver
        self.executor = executor
        self.reviewer = reviewer
        self.state = AgentState()

    def run(self, problem: str) -> dict:
        # Phase 1: Perception -- decompose the problem
        plan_json = self.perceiver.act(problem)
        try:
            tasks = json.loads(plan_json)
        except json.JSONDecodeError:
            # Fallback: treat as single task
            tasks = [{"id": 1, "description": problem, "type": "synthesis", "depends_on": []}]

        self.state.task_plan = tasks
        print(f"[Orchestrator] Plan: {len(tasks)} tasks identified")

        # Phase 2: Execution with dependency resolution
        completed = {}
        for task in sorted(tasks, key=lambda t: len(t.get("depends_on", []))):
            deps_met = all(d in completed for d in task.get("depends_on", []))
            if not deps_met:
                continue

            # Build context from dependencies
            dep_context = ""
            for dep_id in task.get("depends_on", []):
                dep_context += f"\n--- Dependency Task {dep_id} Result ---\n{completed[dep_id]}\n"

            context = f"Task: {task['description']}\nType: {task['type']}{dep_context}"
            result = self.executor.act(context)

            # Phase 3: Review
            review_result = self._review_task(task, result)
            if review_result["status"] == "rejected":
                # Retry with feedback
                for attempt in range(self.state.max_iterations):
                    feedback = review_result.get("suggestion", "Improve the result")
                    retry_context = f"Task: {task['description']}\nPrevious attempt was rejected.\nReviewer feedback: {feedback}\n\nPlease redo the task."
                    result = self.executor.act(retry_context)
                    review_result = self._review_task(task, result)
                    if review_result["status"] == "approved":
                        break

            completed[task["id"]] = result
            self.state.results.append({
                "task_id": task["id"],
                "result": result,
                "review": review_result
            })
            print(f"[Orchestrator] Task {task['id']} {'approved' if review_result['status'] == 'approved' else 'completed with flags'}")

        # Phase 4: Synthesis
        final = self._synthesize(completed)
        return {"plan": tasks, "results": self.state.results, "final_output": final}

    def _review_task(self, task, result) -> dict:
        review_text = self.reviewer.act(
            f"Task: {task['description']}\n\nResult:\n{result}"
        )
        try:
            return json.loads(review_text)
        except json.JSONDecodeError:
            return {"status": "approved"}

    def _synthesize(self, completed: dict) -> str:
        all_results = "\n\n".join(
            f"### Task {tid}\n{result}" for tid, result in completed.items()
        )
        return self.executor.act(
            f"Synthesize these task results into a cohesive final answer:\n\n{all_results}"
        )

Step 4: Using the System

orchestrator = MultiAgentOrchestrator(perceiver, executor, reviewer)

result = orchestrator.run(
    "Analyze the time series data quality of our IoT sensor dataset. "
    "Check for missing values, outliers, seasonal patterns, and sensor drift. "
    "Provide a quality score and remediation recommendations."
)

print(result["final_output"])

Advanced: Parallel Execution with ThreadPoolExecutor

For independent tasks, you can parallelize execution:

def execute_parallel_tasks(self, tasks: list, context: str = "") -> dict:
    results = {}
    with ThreadPoolExecutor(max_workers=3) as pool:
        futures = {}
        for task in tasks:
            futures[pool.submit(self.executor.act, f"{context}\nTask: {task['description']}")] = task
        for future in as_completed(futures):
            task = futures[future]
            results[task["id"]] = future.result()
    return results

Key Design Principles

  1. Role isolation: Each agent has a single responsibility. The Planner never executes; the Critic never generates.
  2. Structured communication: Agents communicate through JSON schemas and explicit handoffs, not free-form chat.
  3. Iterative refinement: The review loop allows correction without restarting the entire pipeline.
  4. Dependency-aware scheduling: Tasks execute in topological order, respecting dependencies.

When to Use Multi-Agent vs. Single Prompt

Use multi-agent when:

  • The task has 3+ distinct phases with different reasoning modes
  • You need verifiable outputs with review cycles
  • Parallelization provides meaningful speedup
  • Different parts of the task benefit from different temperature/settings

Stick with single-prompt when the task is straightforward, latency-sensitive, or the overhead of agent coordination outweighs the quality gains.

What's Next

From here, you can extend the system with tool use (giving agents access to code execution, web search, or database queries), persistent memory between sessions, or dynamic role assignment where the Planner decides how many agents are needed based on task complexity.

The research trajectory is clear: agentic systems are replacing monolithic prompts in production. The question is not whether to adopt them, but how to design them well.