Evaluating and Deploying Your Fine-Tuned Model
Run quantitative evals comparing the base and fine-tuned model on held-out test cases, convert to GGUF for local inference, and serve via llama.cpp or vLLM.
Why Evaluation Must Come Before Deployment
A fine-tuned model that performs well on training data may perform worse than the base model on your actual production use case. The only way to know is rigorous evaluation. Fine-tuning can cause catastrophic forgetting (losing capabilities the base model had), over-specialization (performing well on your task but worse on adjacent tasks), or subtle regressions in safety behaviors. Never deploy a fine-tuned model without evaluating it against the base model on a representative test set.
Building a Held-Out Test Set
Your test set must be completely separate from training and validation data — examples the model has never seen during any phase of training. The test set should represent the full distribution of production inputs: common cases, edge cases, and adversarial inputs. For instruction-following tasks, include examples that require following all aspects of the instruction, not just the most common ones. A test set of 100-500 examples is typically sufficient for reliable evaluation.
import json
from typing import TypedDict
class TestCase(TypedDict):
input: str # the user message
expected_output: str # the ideal response
category: str # e.g., 'format', 'accuracy', 'edge_case'
evaluation_method: str # 'exact_match', 'json_schema', 'llm_judge'
# Load test set (never used during training)
def load_test_set(path: str) -> list[TestCase]:
cases = []
with open(path) as f:
for line in f:
data = json.loads(line.strip())
cases.append({
'input': data['messages'][-2]['content'], # user message
'expected_output': data['messages'][-1]['content'], # assistant response
'category': data.get('metadata', {}).get('category', 'general'),
'evaluation_method': data.get('metadata', {}).get('eval_method', 'llm_judge')
})
return cases
test_set = load_test_set('test.jsonl')
print(f'Test set loaded: {len(test_set)} examples')All lessons in this course
- When Fine-Tuning Beats Prompting
- Preparing a High-Quality Training Dataset
- LoRA Fine-Tuning with Hugging Face PEFT
- Evaluating and Deploying Your Fine-Tuned Model