When to Fine-Tune
Signals that prompting hit limits.
Fine-Tuning Is an Evidence Decision
Fine-tuning is justified only when you can point at data showing prompting hit a wall. The trigger is never a hunch - it is a held-out eval where the best honest prompt plateaus below your quality bar despite climbing the optimization ladder.
- Tuning trades flexibility for consistency, lower per-call cost, and learned behavior
- The cost is a data pipeline, eval infra, and re-tuning on base-model churn
- You must be able to name the specific failure prompting could not fix
Signal 1: The Prompt Plateau
The clearest signal is a plateau on a frozen eval set. You add exemplars, decompose, add verifiers - and the score stops improving while errors remain systematic, not random.
Systematic residual errors (the model consistently mishandles the same construct) mean the behavior is hard to elicit through instruction. That is a tuning-shaped problem. Random scattered errors usually mean the prompt or data is still noisy - keep iterating instead.
# Track eval score vs prompt-iteration; flat tail = plateau
scores = [0.62, 0.71, 0.78, 0.79, 0.795, 0.796] # diminishing returns
def plateaued(scores, window=3, eps=0.01):
tail = scores[-window:]
return (max(tail) - min(tail)) < eps
print(plateaued(scores)) # True -> prompting has stalled