AI Agents · Lesson

Speech-to-Text with Whisper and Deepgram

Real-time and batch transcription, language detection, and punctuation.

Speech-to-Text in Voice Agents

A voice agent must convert spoken audio to text before the LLM can process it. This step is called Speech-to-Text (STT) or Automatic Speech Recognition (ASR).

Two leading options: OpenAI Whisper (file-based, batch) and Deepgram (streaming, real-time, with speaker diarization and word timestamps).

OpenAI Whisper: Basic Transcription

Whisper via the OpenAI API transcribes audio files. Supported formats: mp3, mp4, wav, webm, m4a, flac. Maximum file size: 25MB.

import openai
import os

client = openai.OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

def transcribe_file(audio_path, language=None):
    with open(audio_path, 'rb') as audio_file:
        params = {
            'model': 'whisper-1',
            'file': audio_file,
            'response_format': 'json'  # or 'text', 'srt', 'vtt', 'verbose_json'
        }
        if language:
            params['language'] = language  # e.g., 'en', 'fr', 'de'

        transcript = client.audio.transcriptions.create(**params)

    return transcript.text

# Basic usage
text = transcribe_file('meeting_recording.mp3')
print('Transcribed:', text[:200])

All lessons in this course

Speech-to-Text with Whisper and Deepgram
Text-to-Speech in Agent Responses
Building a Voice Conversation Loop
Latency Optimization for Voice Agents

← Back to AI Agents