AI Agents · Lesson

Audio + Text Agent Workflows

Transcription → reasoning → audio response pipelines end-to-end.

Audio-Text Agent Pipeline Overview

An audio-text agent pipeline converts between the spoken and written worlds. The canonical flow: audio in → Whisper transcription → text agent → TTS audio out. This enables voice assistants, call analysis, meeting summarisation, and hands-free interfaces.

Supported Audio Formats

OpenAI Whisper accepts: mp3, mp4, mpeg, mpga, m4a, wav, webm. Maximum file size is 25 MB. For larger files you must split or compress the audio before sending.

Always record/convert to 16 kHz mono for best transcription quality and smallest file size.

import os

SUPPORTED_FORMATS = {'.mp3', '.mp4', '.mpeg', '.mpga', '.m4a', '.wav', '.webm'}
MAX_FILE_SIZE_BYTES = 25 * 1024 * 1024  # 25 MB

def validate_audio_file(path: str) -> dict:
    ext = os.path.splitext(path)[1].lower()
    size = os.path.getsize(path) if os.path.exists(path) else 0
    return {
        'path': path,
        'format_ok': ext in SUPPORTED_FORMATS,
        'size_ok': size <= MAX_FILE_SIZE_BYTES,
        'size_mb': round(size / 1024 / 1024, 2),
        'extension': ext
    }

# Usage:
info = validate_audio_file('meeting.wav')
print(info)

All lessons in this course

Image + Text Agents with Claude Vision and GPT-4V
Audio + Text Agent Workflows
Video Understanding in Agents
Cross-Modal Reasoning Patterns

← Back to AI Agents