AI Prompt Engineering · Lesson

Evaluating the Decision

Measuring quality and cost.

You Cannot Decide What You Cannot Measure

The prompt-vs-tune-vs-hybrid choice is only as good as the evaluation behind it. Without a frozen eval set and a cost model, every comparison is anecdote.

Quality and cost are two axes - never collapse them into one number prematurely
The eval set must be held out and stable across every approach you compare
The winner is the approach on the best point of the quality-cost frontier for your constraints

Build the Frozen Eval Set First

Before comparing anything, construct a held-out eval set that no approach trains on. It must cover the real distribution: common cases, known edge cases, and adversarial inputs in roughly production proportions.

Freeze it. Every approach - prompt-only, tuned, hybrid - is scored on the identical set. If the eval shifts between comparisons, the numbers are not comparable and the decision is invalid.

def split_eval(labeled, holdout_ratio=0.2, seed=42):
    import random
    rng = random.Random(seed)        # fixed seed = reproducible split
    data = labeled[:]
    rng.shuffle(data)
    cut = int(len(data) * (1 - holdout_ratio))
    train, frozen_eval = data[:cut], data[cut:]
    return train, frozen_eval        # eval never enters any training run

All lessons in this course

← Back to AI Prompt Engineering