Speech-to-Text with Whisper and Deepgram
Real-time and batch transcription, language detection, and punctuation.
Speech-to-Text in Voice Agents
A voice agent must convert spoken audio to text before the LLM can process it. This step is called Speech-to-Text (STT) or Automatic Speech Recognition (ASR).
Two leading options: OpenAI Whisper (file-based, batch) and Deepgram (streaming, real-time, with speaker diarization and word timestamps).
OpenAI Whisper: Basic Transcription
Whisper via the OpenAI API transcribes audio files. Supported formats: mp3, mp4, wav, webm, m4a, flac. Maximum file size: 25MB.
import openai
import os
client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
def transcribe_file(audio_path, language=None):
with open(audio_path, 'rb') as audio_file:
params = {
'model': 'whisper-1',
'file': audio_file,
'response_format': 'json' # or 'text', 'srt', 'vtt', 'verbose_json'
}
if language:
params['language'] = language # e.g., 'en', 'fr', 'de'
transcript = client.audio.transcriptions.create(**params)
return transcript.text
# Basic usage
text = transcribe_file('meeting_recording.mp3')
print('Transcribed:', text[:200])All lessons in this course
- Speech-to-Text with Whisper and Deepgram
- Text-to-Speech in Agent Responses
- Building a Voice Conversation Loop
- Latency Optimization for Voice Agents