AI Agents · Lesson

Vision Models for Screen Understanding

Send screenshots to a multimodal model so the agent 'sees' what a human would see.

Why Vision?

Some web actions are hard to express with selectors:

Captchas (well, attempts)
Visually-positioned elements (chat bubbles, dynamic UIs)
Apps with no stable DOM (Canvas / WebGL)

Multimodal LLMs let the agent SEE the screen.

Take a Screenshot, Send to LLM

import base64
page.screenshot(path='screen.png')
with open('screen.png', 'rb') as f:
    b64 = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model='gpt-4o',
    messages=[{
        'role': 'user',
        'content': [
            {'type': 'text', 'text': 'What buttons are visible? Return JSON.'},
            {'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{b64}'}}
        ]
    }]
)

All lessons in this course

Browser Automation with Playwright
Vision Models for Screen Understanding
Computer-Use Patterns (Anthropic Computer-Use)
Building a Reliable Form-Filling Agent

← Back to AI Agents