0PricingLogin
AI Prompt Engineering · Lesson

Combining Text and Images

Unified multimodal prompts.

The Unified Multimodal Prompt

A multimodal prompt is not text plus an attached image. It is a single interleaved sequence where image tokens and text tokens occupy the same context and attend to one another. The model does not 'look at the picture then read the text' — it processes a fused token stream.

  • Image patches are projected into the same embedding space as text tokens.
  • Ordering matters: a question placed before an image primes different attention than one placed after.
  • You are authoring one prompt with two surface forms, not two prompts.

Interleaving Order and Anchoring

Where you place the image relative to the instruction changes behavior. Placing the instruction after the image lets the model condition the question on what it has already encoded; placing it before turns the image into evidence for a pre-stated task.

For multi-image prompts, label each image inline so later text can reference it unambiguously ([Image 1], [Image 2]).

messages = [
  {'role': 'user', 'content': [
    {'type': 'text', 'text': 'You will see two product photos.'},
    {'type': 'text', 'text': 'Image 1:'},
    {'type': 'image', 'source': img_a},
    {'type': 'text', 'text': 'Image 2:'},
    {'type': 'image', 'source': img_b},
    {'type': 'text', 'text': 'Which has better lighting? Cite the image label.'}
  ]}
]

All lessons in this course

  1. Combining Text and Images
  2. Grounding Across Modalities
  3. Audio, Text and Vision Together
  4. Multimodal Output Control
← Back to AI Prompt Engineering