Visual Question Answering
Asking specific questions about image content, quantities, and attributes.
Visual Question Answering
Visual Question Answering (VQA) is the task of answering natural language questions about an image. Unlike image description (which describes everything), VQA focuses the model on answering a specific question.
VQA prompts are precise, direct, and often require counting, identifying, comparing, or reasoning about visual content. The quality of the prompt determines whether you get a precise, useful answer or a vague general response.
Basic VQA Prompt Structure
A VQA prompt pairs an image with a specific question. The key is making the question precise enough to produce a direct, usable answer:
import anthropic, base64
client = anthropic.Anthropic(api_key='YOUR_API_KEY')
def ask_about_image(image_path, question, answer_format='Direct answer. No extra explanation.'):
with open(image_path, 'rb') as f:
img_b64 = base64.standard_b64encode(f.read()).decode('utf-8')
prompt = f'{question}\n\n{answer_format}'
r = client.messages.create(
model='claude-opus-4-5', max_tokens=150,
messages=[{'role': 'user', 'content': [
{'type': 'image', 'source': {'type': 'base64', 'media_type': 'image/jpeg', 'data': img_b64}},
{'type': 'text', 'text': prompt}
]}]
)
return r.content[0].text
# Example VQA calls (replace image.jpg with actual image)
print('VQA function defined. Ready for image questions.')