Evaluating DSPy Pipelines
Metrics, dev sets, and the evaluate() function for automated assessment.
Why Evaluation Matters in DSPy
DSPy optimization is only as good as your evaluation. A weak metric produces a compiled program that scores well on that metric but fails in production. A proper evaluation harness lets you compare unoptimized vs optimized programs and catch regressions when you update your pipeline.
The dspy.Evaluate Class
dspy.Evaluate runs your program over a dataset, applies a metric, and reports aggregate scores. It supports parallelism via num_threads for fast evaluation over large datasets.
import dspy
# Build a devset of labeled examples
devset = [
dspy.Example(question='What is 7 * 8?', answer='56').with_inputs('question'),
dspy.Example(question='Name the largest planet.', answer='Jupiter').with_inputs('question'),
# ... more examples
]
# Create evaluator
evaluate = dspy.Evaluate(
devset=devset,
metric=exact_match_metric, # Your metric function
num_threads=4, # Parallel evaluation
display_progress=True, # Show progress bar
display_table=True, # Show per-example results
)
# Run
score = evaluate(my_program)
print(f'Overall score: {score:.1%}')All lessons in this course
- Introduction to DSPy Framework
- Defining Signatures and Modules
- Compiling and Optimizing Prompts
- Evaluating DSPy Pipelines