AI Agents · Lesson

Image + Text Agents with Claude Vision and GPT-4V

Sending images in API calls, visual grounding, and image-aware tool use.

Multimodal Agents: Images + Text

A multimodal agent can see as well as read. By sending images alongside text in the same API call, the agent can answer questions about photos, analyse charts, read screenshots, and describe diagrams — all within the same conversation loop.

Sending Images with the OpenAI API

OpenAI's vision models (GPT-4o, GPT-4V) accept a content array instead of a plain string. Each element is either a text object or an image_url object. The image can be a public URL or a base64-encoded data URL.

from openai import OpenAI

client = OpenAI(api_key='YOUR_OPENAI_API_KEY')

# Using a public URL
response = client.chat.completions.create(
    model='gpt-4o',
    messages=[
        {
            'role': 'user',
            'content': [
                {
                    'type': 'image_url',
                    'image_url': {
                        'url': 'https://example.com/chart.png'
                    }
                },
                {
                    'type': 'text',
                    'text': 'What trend does this chart show?'
                }
            ]
        }
    ],
    max_tokens=512
)
print(response.choices[0].message.content)

All lessons in this course

Image + Text Agents with Claude Vision and GPT-4V
Audio + Text Agent Workflows
Video Understanding in Agents
Cross-Modal Reasoning Patterns

← Back to AI Agents