Audio + Text Agent Workflows
Transcription → reasoning → audio response pipelines end-to-end.
Audio-Text Agent Pipeline Overview
An audio-text agent pipeline converts between the spoken and written worlds. The canonical flow: audio in → Whisper transcription → text agent → TTS audio out. This enables voice assistants, call analysis, meeting summarisation, and hands-free interfaces.
Supported Audio Formats
OpenAI Whisper accepts: mp3, mp4, mpeg, mpga, m4a, wav, webm. Maximum file size is 25 MB. For larger files you must split or compress the audio before sending.
Always record/convert to 16 kHz mono for best transcription quality and smallest file size.
import os
SUPPORTED_FORMATS = {'.mp3', '.mp4', '.mpeg', '.mpga', '.m4a', '.wav', '.webm'}
MAX_FILE_SIZE_BYTES = 25 * 1024 * 1024 # 25 MB
def validate_audio_file(path: str) -> dict:
ext = os.path.splitext(path)[1].lower()
size = os.path.getsize(path) if os.path.exists(path) else 0
return {
'path': path,
'format_ok': ext in SUPPORTED_FORMATS,
'size_ok': size <= MAX_FILE_SIZE_BYTES,
'size_mb': round(size / 1024 / 1024, 2),
'extension': ext
}
# Usage:
info = validate_audio_file('meeting.wav')
print(info)All lessons in this course
- Image + Text Agents with Claude Vision and GPT-4V
- Audio + Text Agent Workflows
- Video Understanding in Agents
- Cross-Modal Reasoning Patterns