AI Agents · Lesson

Cross-Modal Reasoning Patterns

Grounding text claims in images and synthesizing multi-source multimodal context.

Cross-Modal Reasoning

Cross-modal reasoning occurs when an agent must reconcile information from two or more modalities — text, images, charts, tables — that may agree, contradict, or complement each other. Example: a report text says revenue grew 20%, but the attached chart shows a flat line.

The agent must decide which source is correct, or flag the discrepancy for human review.

Text-Image Grounding

Text-image grounding means verifying that claims made in text can be visually confirmed in an accompanying image. For example: does the product description match the product photo? Does the document mention a table that actually appears in the image?

import anthropic
import base64

def ground_text_in_image(
    text_claim: str,
    image_path: str
) -> dict:
    client = anthropic.Anthropic(api_key='YOUR_API_KEY')
    with open(image_path, 'rb') as f:
        b64 = base64.standard_b64encode(f.read()).decode('utf-8')

    prompt = (
        f'Text claim: "{text_claim}"\n\n'
        'Does the image above support, contradict, or partially support this claim?\n'
        'Return JSON: {"verdict": "support|contradict|partial|insufficient_evidence", '
        '"confidence": 0.0, "evidence": "..."}'
    )
    response = client.messages.create(
        model='claude-opus-4-5', max_tokens=256,
        messages=[{'role': 'user', 'content': [
            {'type': 'image', 'source': {'type': 'base64',
              'media_type': 'image/jpeg', 'data': b64}},
            {'type': 'text', 'text': prompt}
        ]}]
    )
    import json
    return json.loads(response.content[0].text)

All lessons in this course

Image + Text Agents with Claude Vision and GPT-4V
Audio + Text Agent Workflows
Video Understanding in Agents
Cross-Modal Reasoning Patterns

← Back to AI Agents