AI Engineering Academy · Lesson

Evaluating and Deploying Your Fine-Tuned Model

Run quantitative evals comparing the base and fine-tuned model on held-out test cases, convert to GGUF for local inference, and serve via llama.cpp or vLLM.

Why Evaluation Must Come Before Deployment

A fine-tuned model that performs well on training data may perform worse than the base model on your actual production use case. The only way to know is rigorous evaluation. Fine-tuning can cause catastrophic forgetting (losing capabilities the base model had), over-specialization (performing well on your task but worse on adjacent tasks), or subtle regressions in safety behaviors. Never deploy a fine-tuned model without evaluating it against the base model on a representative test set.

Building a Held-Out Test Set

Your test set must be completely separate from training and validation data — examples the model has never seen during any phase of training. The test set should represent the full distribution of production inputs: common cases, edge cases, and adversarial inputs. For instruction-following tasks, include examples that require following all aspects of the instruction, not just the most common ones. A test set of 100-500 examples is typically sufficient for reliable evaluation.

import json
from typing import TypedDict

class TestCase(TypedDict):
    input: str                 # the user message
    expected_output: str       # the ideal response
    category: str              # e.g., 'format', 'accuracy', 'edge_case'
    evaluation_method: str     # 'exact_match', 'json_schema', 'llm_judge'

# Load test set (never used during training)
def load_test_set(path: str) -> list[TestCase]:
    cases = []
    with open(path) as f:
        for line in f:
            data = json.loads(line.strip())
            cases.append({
                'input': data['messages'][-2]['content'],  # user message
                'expected_output': data['messages'][-1]['content'],  # assistant response
                'category': data.get('metadata', {}).get('category', 'general'),
                'evaluation_method': data.get('metadata', {}).get('eval_method', 'llm_judge')
            })
    return cases

test_set = load_test_set('test.jsonl')
print(f'Test set loaded: {len(test_set)} examples')

All lessons in this course

When Fine-Tuning Beats Prompting
Preparing a High-Quality Training Dataset
LoRA Fine-Tuning with Hugging Face PEFT
Evaluating and Deploying Your Fine-Tuned Model

← Back to AI Engineering Academy