0PricingLogin
AI Agents · Lesson

Multimodal Agents (Vision + Voice + Action)

Agents that see screens, hear speech, and act in the physical or digital world — the next frontier.

Beyond Text

Modern agents are increasingly multimodal:

  • Vision — see screens, images, video
  • Voice — talk to and listen to users
  • Action — control browsers, robots, GUIs

Vision Agents

We covered this in Course 24. Recap:

  • GPT-4o, Claude Sonnet 4.5, Gemini for screen understanding
  • SoM annotations for reliable interaction
  • Computer Use for end-to-end desktop control

All lessons in this course

  1. Agentic Reasoning (o1, o3, Reasoning Models)
  2. Hybrid Symbolic + Neural Agents
  3. Multimodal Agents (Vision + Voice + Action)
  4. Open Problems: Robustness, Alignment, Long-Horizon Memory
← Back to AI Agents