AI Engineering Academy · Lesson

Measuring LLM Latency: TTFT and TPOT

Define time to first token and time per output token as the two key latency metrics, instrument your application to measure both, and establish per-endpoint SLA targets.

Why LLM Latency Has Two Components

Measuring LLM latency as a single number is misleading. There are actually two distinct phases: the time until the first token arrives (perceived responsiveness), and the time it takes to generate subsequent tokens (output speed). A model can have great TTFT but slow TPOT, making long responses feel sluggish even though the initial response felt instant.

TTFT: Time to First Token

Time to First Token (TTFT) is the duration from sending the API request to receiving the very first token of the response. It includes network latency, queuing time at the inference server, and prefill time (processing the input tokens). TTFT dominates perceived responsiveness — users notice when nothing appears for more than 1-2 seconds, regardless of how fast tokens stream afterward.

import time
from openai import OpenAI

client = OpenAI()

def measure_ttft(prompt: str) -> float:
    start = time.perf_counter()
    first_token_time = None
    stream = client.chat.completions.create(
        model='gpt-4o',
        messages=[{'role': 'user', 'content': prompt}],
        stream=True
    )
    for chunk in stream:
        if chunk.choices[0].delta.content:
            first_token_time = time.perf_counter()
            break  # stop after first token
    return first_token_time - start

All lessons in this course

Measuring LLM Latency: TTFT and TPOT
Load Balancing and Multi-Key Strategies
Fallback Providers and Circuit Breakers
Timeout Budgets and Graceful Degradation

← Back to AI Engineering Academy