0Pricing
AI Agents · Lesson

Latency Optimization for Voice Agents

Streaming TTS, response chunking, and first-word latency minimization.

Why Latency Matters for Voice

In text chat, a 3-second delay is acceptable. In voice conversation, anything over 1.5 seconds feels unnatural and breaks the conversational flow.

Voice latency has three main components: transcription time (STT), LLM processing time (TTFT + generation), and TTS synthesis time. Optimizing each one compounds into a dramatically better user experience.

Measuring the Latency Budget

Before optimizing, measure each component. Instrument the loop with timing calls to know where time is actually spent.

import time

def voice_loop_timed():
    timing = {}

    # 1. Record
    t0 = time.perf_counter()
    audio, sr = record_until_silence()
    timing['record'] = time.perf_counter() - t0

    # 2. Transcribe
    t0 = time.perf_counter()
    audio_path = audio_to_file(audio, sr)
    user_text = transcribe_file(audio_path)
    timing['transcription'] = time.perf_counter() - t0

    # 3. LLM
    t0 = time.perf_counter()
    agent_text = agent.respond(user_text)
    timing['llm'] = time.perf_counter() - t0

    # 4. TTS
    t0 = time.perf_counter()
    say(agent_text)
    timing['tts'] = time.perf_counter() - t0

    print('Latency breakdown:')
    total = sum(timing.values())
    for step, duration in timing.items():
        print(f'  {step:15} {duration*1000:.0f}ms ({duration/total*100:.0f}%)')

All lessons in this course

  1. Speech-to-Text with Whisper and Deepgram
  2. Text-to-Speech in Agent Responses
  3. Building a Voice Conversation Loop
  4. Latency Optimization for Voice Agents
← Back to AI Agents