Multimodal Agents (Vision + Voice + Action)
Agents that see screens, hear speech, and act in the physical or digital world — the next frontier.
Beyond Text
Modern agents are increasingly multimodal:
- Vision — see screens, images, video
- Voice — talk to and listen to users
- Action — control browsers, robots, GUIs
Vision Agents
We covered this in Course 24. Recap:
- GPT-4o, Claude Sonnet 4.5, Gemini for screen understanding
- SoM annotations for reliable interaction
- Computer Use for end-to-end desktop control
All lessons in this course
- Agentic Reasoning (o1, o3, Reasoning Models)
- Hybrid Symbolic + Neural Agents
- Multimodal Agents (Vision + Voice + Action)
- Open Problems: Robustness, Alignment, Long-Horizon Memory