AI Prompt Engineering · Lesson

Cost and Latency Tradeoffs

Thinking token budgets, inference costs, and hybrid routing strategies.

The Cost-Quality-Latency Triangle

In LLM system design, there is a fundamental triangle: cost, quality, and latency. You can optimize for at most two of the three at any given time.

Low cost + High quality = Slow (reasoning models, slow generation)
Low cost + Low latency = Lower quality (small/fast models)
High quality + Low latency = Expensive (reasoning model with streaming)

Every architecture decision is a trade-off within this triangle.

Reasoning Model Pricing

Thinking tokens cost extra in addition to the standard input/output tokens. The cost of a reasoning model call includes: input tokens + thinking tokens + output tokens.

Compared to fast/small models: o3 is roughly 20x more expensive than GPT-4o-mini per token; Claude Opus with extended thinking is roughly 10-15x more expensive than Claude Haiku per output token.

# Rough cost estimates (2025 pricing, may change)
# Source: provider pricing pages

PRICING = {
    # (input $/1M tokens, output $/1M tokens)
    'gpt-4o-mini':       (0.15,   0.60),
    'gpt-4o':            (2.50,  10.00),
    'o3-mini':           (1.10,   4.40),
    'o3':                (10.0,  40.00),
    'claude-haiku-4-5':  (0.25,   1.25),
    'claude-sonnet-4-5': (3.00,  15.00),
    'claude-opus-4-5':   (15.0,  75.00),
}

def estimate_cost(model, input_tokens, output_tokens, thinking_tokens=0):
    inp_price, out_price = PRICING[model]
    # Thinking tokens billed as output tokens
    total_out = output_tokens + thinking_tokens
    cost = (input_tokens / 1e6 * inp_price) + (total_out / 1e6 * out_price)
    return cost

# A single hard question with 8000 thinking tokens:
cost = estimate_cost('claude-opus-4-5', 500, 300, thinking_tokens=8000)
print(f'Cost per call: ${cost:.4f}')

All lessons in this course

How Reasoning Models Differ
Effective Prompts for Extended Thinking
When to Use Reasoning vs Standard Models
Cost and Latency Tradeoffs

← Back to AI Prompt Engineering