AI Engineering Academy · Lesson

Batching, Model Routing, and Cost Dashboards

Route simple requests to cheaper models like GPT-4o-mini and complex ones to GPT-4o, batch non-urgent requests, and build a cost dashboard tracking spending by feature.

Three More Levers for Cost Optimization

After caching, three additional strategies dramatically reduce LLM operating costs: batching (defer non-urgent requests and submit them in bulk at a lower API rate), model routing (route simple queries to cheap models and complex ones to powerful models), and cost dashboards (track spending per feature to identify where optimizations have the highest ROI). Together these can reduce costs by another 40-60 percent beyond caching.

OpenAI Batch API: 50% Off for Async Workloads

OpenAI's Batch API accepts a JSONL file containing up to 50,000 requests and processes them asynchronously within 24 hours at 50 percent of the standard price. This is ideal for non-interactive workloads: embedding large document corpora, generating product descriptions, running nightly evaluations, or preprocessing training data. The trade-off is latency — results are available hours later, not immediately.

import json
from openai import OpenAI

client = OpenAI()

# Prepare batch file
requests = [
    {
        'custom_id': f'req_{i}',
        'method': 'POST',
        'url': '/v1/chat/completions',
        'body': {
            'model': 'gpt-4o-mini',
            'messages': [
                {'role': 'user', 'content': f'Summarize: {document}'}
            ],
            'max_tokens': 150,
        }
    }
    for i, document in enumerate(documents_to_process)
]

# Write JSONL batch file
with open('/tmp/batch_requests.jsonl', 'w') as f:
    for req in requests:
        f.write(json.dumps(req) + '\n')

# Upload and submit batch
with open('/tmp/batch_requests.jsonl', 'rb') as f:
    batch_file = client.files.create(file=f, purpose='batch')

batch = client.batches.create(
    input_file_id=batch_file.id,
    endpoint='/v1/chat/completions',
    completion_window='24h',
)
print(f'Batch {batch.id} submitted, status: {batch.status}')

All lessons in this course

Exact Caching with Redis
Semantic Caching with Embeddings
OpenAI Prompt Prefix Caching
Batching, Model Routing, and Cost Dashboards

← Back to AI Engineering Academy