AI Prompt Engineering · Lesson

Evaluating DSPy Pipelines

Metrics, dev sets, and the evaluate() function for automated assessment.

Why Evaluation Matters in DSPy

DSPy optimization is only as good as your evaluation. A weak metric produces a compiled program that scores well on that metric but fails in production. A proper evaluation harness lets you compare unoptimized vs optimized programs and catch regressions when you update your pipeline.

The dspy.Evaluate Class

dspy.Evaluate runs your program over a dataset, applies a metric, and reports aggregate scores. It supports parallelism via num_threads for fast evaluation over large datasets.

import dspy

# Build a devset of labeled examples
devset = [
    dspy.Example(question='What is 7 * 8?', answer='56').with_inputs('question'),
    dspy.Example(question='Name the largest planet.', answer='Jupiter').with_inputs('question'),
    # ... more examples
]

# Create evaluator
evaluate = dspy.Evaluate(
    devset=devset,
    metric=exact_match_metric,  # Your metric function
    num_threads=4,              # Parallel evaluation
    display_progress=True,      # Show progress bar
    display_table=True,         # Show per-example results
)

# Run
score = evaluate(my_program)
print(f'Overall score: {score:.1%}')

All lessons in this course

← Back to AI Prompt Engineering