Evaluating the Decision
Measuring quality and cost.
You Cannot Decide What You Cannot Measure
The prompt-vs-tune-vs-hybrid choice is only as good as the evaluation behind it. Without a frozen eval set and a cost model, every comparison is anecdote.
- Quality and cost are two axes - never collapse them into one number prematurely
- The eval set must be held out and stable across every approach you compare
- The winner is the approach on the best point of the quality-cost frontier for your constraints
Build the Frozen Eval Set First
Before comparing anything, construct a held-out eval set that no approach trains on. It must cover the real distribution: common cases, known edge cases, and adversarial inputs in roughly production proportions.
Freeze it. Every approach - prompt-only, tuned, hybrid - is scored on the identical set. If the eval shifts between comparisons, the numbers are not comparable and the decision is invalid.
def split_eval(labeled, holdout_ratio=0.2, seed=42):
import random
rng = random.Random(seed) # fixed seed = reproducible split
data = labeled[:]
rng.shuffle(data)
cut = int(len(data) * (1 - holdout_ratio))
train, frozen_eval = data[:cut], data[cut:]
return train, frozen_eval # eval never enters any training runAll lessons in this course
- When Prompting Is Enough
- When to Fine-Tune
- Hybrid: Prompt + Light Tuning
- Evaluating the Decision