Text-to-Speech in Agent Responses
OpenAI TTS, ElevenLabs, and Google TTS APIs in agent pipelines.
Why Agents Need TTS
A voice agent that only listens but responds in text is not a true voice agent. Text-to-Speech (TTS) converts the agent's text responses into spoken audio, completing the full voice loop.
Two leading APIs: OpenAI TTS (fast, affordable, 6 voices) and ElevenLabs (ultra-realistic voices, streaming, voice cloning).
OpenAI TTS Basics
OpenAI's TTS API converts text to speech in seconds. Six built-in voices: alloy, echo, fable, onyx, nova, shimmer. Supports MP3, opus, AAC, and FLAC output formats.
import openai
import os
client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
def text_to_speech(text, voice='alloy', output_path='response.mp3'):
response = client.audio.speech.create(
model='tts-1', # tts-1 (fast) or tts-1-hd (higher quality)
voice=voice, # alloy, echo, fable, onyx, nova, shimmer
input=text,
response_format='mp3' # mp3, opus, aac, flac
)
with open(output_path, 'wb') as f:
f.write(response.content)
print(f'Audio saved to: {output_path}')
return output_path
# Available voices:
# alloy - neutral, balanced
# echo - warm, conversational
# fable - expressive
# onyx - deep, authoritative
# nova - friendly, upbeat
# shimmer - soft, clearAll lessons in this course
- Speech-to-Text with Whisper and Deepgram
- Text-to-Speech in Agent Responses
- Building a Voice Conversation Loop
- Latency Optimization for Voice Agents