AI Prompt Engineering · Lesson

Multimodal Voice and Text Agents

Coordinating spoken responses with on-screen text in voice agent systems.

Voice-Only vs Multimodal Contexts

Voice AI agents operate in two fundamentally different contexts:

Voice-only: Smart speakers, IVR, phone calls — users hear audio only, no screen
Multimodal: Mobile apps, web apps, car dashboards — users can see a screen AND hear audio simultaneously

These contexts require different response strategies. In voice-only, everything must be spoken. In multimodal, you can coordinate what's spoken with what's displayed.

Prompting for Voice-Only Responses

In voice-only contexts, the LLM must produce responses that work entirely without visuals. This means no references to screen elements, no lists that require visual scanning, and no content that only makes sense with formatting.

VOICE_ONLY_SYSTEM_PROMPT = (
    'You are a voice-only assistant. The user cannot see any screen.\n\n'
    'Requirements:\n'
    '- Never reference visual elements ("tap here", "see the chart", "the blue button")\n'
    '- Never use numbered or bulleted lists — use spoken sequences instead:\n'
    '  BAD: "1. First do X 2. Then do Y"\n'
    '  GOOD: "Start by doing X. When that is done, do Y."\n'
    '- Limit responses to what can be comfortably spoken in 30 seconds\n'
    '- Offer to give more detail rather than overwhelming the user\n'
    '- Use verbal signposts: "First", "Next", "Finally"\n'
    '- Read out all important data: codes, dates, amounts as full words'
)
print(VOICE_ONLY_SYSTEM_PROMPT)

All lessons in this course

← Back to AI Prompt Engineering