AI Agents · Lesson

Video Understanding in Agents

Frame extraction, video summarization, and temporal reasoning over clips.

Video Understanding for Agents

Most LLMs cannot directly process video files, but agents can analyse video by extracting representative frames and sending them as images. The agent then reasons over the visual sequence to detect events, changes, and patterns over time.

Installing OpenCV for Frame Extraction

OpenCV (cv2) is the standard library for video processing in Python. It lets you open video files, read their metadata (frame rate, resolution, frame count), and extract individual frames as NumPy arrays.

# pip install opencv-python-headless
import cv2

def get_video_metadata(video_path: str) -> dict:
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        raise IOError(f'Cannot open video: {video_path}')

    fps = cap.get(cv2.CAP_PROP_FPS)
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    duration_s = frame_count / fps if fps > 0 else 0
    cap.release()

    return {
        'fps': round(fps, 2),
        'frame_count': frame_count,
        'width': width,
        'height': height,
        'duration_seconds': round(duration_s, 2)
    }

meta = get_video_metadata('recording.mp4')
print(meta)

All lessons in this course

Image + Text Agents with Claude Vision and GPT-4V
Audio + Text Agent Workflows
Video Understanding in Agents
Cross-Modal Reasoning Patterns

← Back to AI Agents