Cross-Modal Reasoning Patterns
Grounding text claims in images and synthesizing multi-source multimodal context.
Cross-Modal Reasoning
Cross-modal reasoning occurs when an agent must reconcile information from two or more modalities — text, images, charts, tables — that may agree, contradict, or complement each other. Example: a report text says revenue grew 20%, but the attached chart shows a flat line.
The agent must decide which source is correct, or flag the discrepancy for human review.
Text-Image Grounding
Text-image grounding means verifying that claims made in text can be visually confirmed in an accompanying image. For example: does the product description match the product photo? Does the document mention a table that actually appears in the image?
import anthropic
import base64
def ground_text_in_image(
text_claim: str,
image_path: str
) -> dict:
client = anthropic.Anthropic(api_key='YOUR_API_KEY')
with open(image_path, 'rb') as f:
b64 = base64.standard_b64encode(f.read()).decode('utf-8')
prompt = (
f'Text claim: "{text_claim}"\n\n'
'Does the image above support, contradict, or partially support this claim?\n'
'Return JSON: {"verdict": "support|contradict|partial|insufficient_evidence", '
'"confidence": 0.0, "evidence": "..."}'
)
response = client.messages.create(
model='claude-opus-4-5', max_tokens=256,
messages=[{'role': 'user', 'content': [
{'type': 'image', 'source': {'type': 'base64',
'media_type': 'image/jpeg', 'data': b64}},
{'type': 'text', 'text': prompt}
]}]
)
import json
return json.loads(response.content[0].text)All lessons in this course
- Image + Text Agents with Claude Vision and GPT-4V
- Audio + Text Agent Workflows
- Video Understanding in Agents
- Cross-Modal Reasoning Patterns