0 Pricing Login▼

AI Agents · Lesson

Multimodal Agents (Vision + Voice + Action)

Agents that see screens, hear speech, and act in the physical or digital world — the next frontier.

Beyond Text

Modern agents are increasingly multimodal:

Vision — see screens, images, video
Voice — talk to and listen to users
Action — control browsers, robots, GUIs

Vision Agents

We covered this in Course 24. Recap:

GPT-4o, Claude Sonnet 4.5, Gemini for screen understanding
SoM annotations for reliable interaction
Computer Use for end-to-end desktop control

All lessons in this course

Agentic Reasoning (o1, o3, Reasoning Models)
Hybrid Symbolic + Neural Agents
Multimodal Agents (Vision + Voice + Action)
Open Problems: Robustness, Alignment, Long-Horizon Memory

← Back to AI Agents