AI Prompt Engineering · Lesson

When to Fine-Tune

Signals that prompting hit limits.

Fine-Tuning Is an Evidence Decision

Fine-tuning is justified only when you can point at data showing prompting hit a wall. The trigger is never a hunch - it is a held-out eval where the best honest prompt plateaus below your quality bar despite climbing the optimization ladder.

Tuning trades flexibility for consistency, lower per-call cost, and learned behavior
The cost is a data pipeline, eval infra, and re-tuning on base-model churn
You must be able to name the specific failure prompting could not fix

Signal 1: The Prompt Plateau

The clearest signal is a plateau on a frozen eval set. You add exemplars, decompose, add verifiers - and the score stops improving while errors remain systematic, not random.

Systematic residual errors (the model consistently mishandles the same construct) mean the behavior is hard to elicit through instruction. That is a tuning-shaped problem. Random scattered errors usually mean the prompt or data is still noisy - keep iterating instead.

# Track eval score vs prompt-iteration; flat tail = plateau
scores = [0.62, 0.71, 0.78, 0.79, 0.795, 0.796]  # diminishing returns
def plateaued(scores, window=3, eps=0.01):
    tail = scores[-window:]
    return (max(tail) - min(tail)) < eps

print(plateaued(scores))  # True -> prompting has stalled

All lessons in this course

← Back to AI Prompt Engineering