Latency Optimization for Voice Agents
Streaming TTS, response chunking, and first-word latency minimization.
Why Latency Matters for Voice
In text chat, a 3-second delay is acceptable. In voice conversation, anything over 1.5 seconds feels unnatural and breaks the conversational flow.
Voice latency has three main components: transcription time (STT), LLM processing time (TTFT + generation), and TTS synthesis time. Optimizing each one compounds into a dramatically better user experience.
Measuring the Latency Budget
Before optimizing, measure each component. Instrument the loop with timing calls to know where time is actually spent.
import time
def voice_loop_timed():
timing = {}
# 1. Record
t0 = time.perf_counter()
audio, sr = record_until_silence()
timing['record'] = time.perf_counter() - t0
# 2. Transcribe
t0 = time.perf_counter()
audio_path = audio_to_file(audio, sr)
user_text = transcribe_file(audio_path)
timing['transcription'] = time.perf_counter() - t0
# 3. LLM
t0 = time.perf_counter()
agent_text = agent.respond(user_text)
timing['llm'] = time.perf_counter() - t0
# 4. TTS
t0 = time.perf_counter()
say(agent_text)
timing['tts'] = time.perf_counter() - t0
print('Latency breakdown:')
total = sum(timing.values())
for step, duration in timing.items():
print(f' {step:15} {duration*1000:.0f}ms ({duration/total*100:.0f}%)')All lessons in this course
- Speech-to-Text with Whisper and Deepgram
- Text-to-Speech in Agent Responses
- Building a Voice Conversation Loop
- Latency Optimization for Voice Agents