Benchmark Suites: SWE-Bench, GAIA, ToolBench
Public benchmarks for coding agents (SWE-Bench), general assistants (GAIA), and tool use (ToolBench).
Public Benchmarks
For comparing models and frameworks across teams, public benchmarks are essential. They cover common agent capabilities:
- SWE-Bench — software engineering tasks
- GAIA — general assistant tasks
- ToolBench / BFCL — tool use
- WebArena — browser navigation
SWE-Bench
SWE-Bench tests whether an agent can solve real GitHub issues:
- 2294 real issues from popular Python repos
- The agent must produce a patch that passes all hidden tests
- Top frontier agents now solve 50-70% (was <5% in 2023)
All lessons in this course
- Eval-Driven Development for Agents
- Building a Golden Test Set
- LLM-as-a-Judge Pitfalls
- Benchmark Suites: SWE-Bench, GAIA, ToolBench