Video Understanding in Agents
Frame extraction, video summarization, and temporal reasoning over clips.
Video Understanding for Agents
Most LLMs cannot directly process video files, but agents can analyse video by extracting representative frames and sending them as images. The agent then reasons over the visual sequence to detect events, changes, and patterns over time.
Installing OpenCV for Frame Extraction
OpenCV (cv2) is the standard library for video processing in Python. It lets you open video files, read their metadata (frame rate, resolution, frame count), and extract individual frames as NumPy arrays.
# pip install opencv-python-headless
import cv2
def get_video_metadata(video_path: str) -> dict:
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
raise IOError(f'Cannot open video: {video_path}')
fps = cap.get(cv2.CAP_PROP_FPS)
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
duration_s = frame_count / fps if fps > 0 else 0
cap.release()
return {
'fps': round(fps, 2),
'frame_count': frame_count,
'width': width,
'height': height,
'duration_seconds': round(duration_s, 2)
}
meta = get_video_metadata('recording.mp4')
print(meta)All lessons in this course
- Image + Text Agents with Claude Vision and GPT-4V
- Audio + Text Agent Workflows
- Video Understanding in Agents
- Cross-Modal Reasoning Patterns