Vision Models for Screen Understanding
Send screenshots to a multimodal model so the agent 'sees' what a human would see.
Why Vision?
Some web actions are hard to express with selectors:
- Captchas (well, attempts)
- Visually-positioned elements (chat bubbles, dynamic UIs)
- Apps with no stable DOM (Canvas / WebGL)
Multimodal LLMs let the agent SEE the screen.
Take a Screenshot, Send to LLM
import base64
page.screenshot(path='screen.png')
with open('screen.png', 'rb') as f:
b64 = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model='gpt-4o',
messages=[{
'role': 'user',
'content': [
{'type': 'text', 'text': 'What buttons are visible? Return JSON.'},
{'type': 'image_url', 'image_url': {'url': f'data:image/png;base64,{b64}'}}
]
}]
)All lessons in this course
- Browser Automation with Playwright
- Vision Models for Screen Understanding
- Computer-Use Patterns (Anthropic Computer-Use)
- Building a Reliable Form-Filling Agent