Cost and Latency Tradeoffs
Thinking token budgets, inference costs, and hybrid routing strategies.
The Cost-Quality-Latency Triangle
In LLM system design, there is a fundamental triangle: cost, quality, and latency. You can optimize for at most two of the three at any given time.
- Low cost + High quality = Slow (reasoning models, slow generation)
- Low cost + Low latency = Lower quality (small/fast models)
- High quality + Low latency = Expensive (reasoning model with streaming)
Every architecture decision is a trade-off within this triangle.
Reasoning Model Pricing
Thinking tokens cost extra in addition to the standard input/output tokens. The cost of a reasoning model call includes: input tokens + thinking tokens + output tokens.
Compared to fast/small models: o3 is roughly 20x more expensive than GPT-4o-mini per token; Claude Opus with extended thinking is roughly 10-15x more expensive than Claude Haiku per output token.
# Rough cost estimates (2025 pricing, may change)
# Source: provider pricing pages
PRICING = {
# (input $/1M tokens, output $/1M tokens)
'gpt-4o-mini': (0.15, 0.60),
'gpt-4o': (2.50, 10.00),
'o3-mini': (1.10, 4.40),
'o3': (10.0, 40.00),
'claude-haiku-4-5': (0.25, 1.25),
'claude-sonnet-4-5': (3.00, 15.00),
'claude-opus-4-5': (15.0, 75.00),
}
def estimate_cost(model, input_tokens, output_tokens, thinking_tokens=0):
inp_price, out_price = PRICING[model]
# Thinking tokens billed as output tokens
total_out = output_tokens + thinking_tokens
cost = (input_tokens / 1e6 * inp_price) + (total_out / 1e6 * out_price)
return cost
# A single hard question with 8000 thinking tokens:
cost = estimate_cost('claude-opus-4-5', 500, 300, thinking_tokens=8000)
print(f'Cost per call: ${cost:.4f}')