AI Engineering Academy · Lesson

Preparing a High-Quality Training Dataset

Collect, clean, and format instruction-following data in the Alpaca and ShareGPT formats, apply data deduplication, and split into train and validation sets.

Data Is the Most Important Fine-Tuning Factor

In fine-tuning, the quality of your training data matters more than any hyperparameter, architecture choice, or training technique. 100 high-quality examples outperform 10,000 mediocre ones. Garbage in, garbage out — the fine-tuned model will faithfully reproduce whatever patterns exist in your data, including mistakes, biases, and format inconsistencies. Investing in data quality is the highest-leverage action in any fine-tuning project.

Instruction-Following Formats

Most fine-tuning for instruction-following tasks uses a conversational message format with system, user, and assistant roles. OpenAI's fine-tuning API uses JSONL files where each line is a complete conversation example. The Alpaca format (instruction/input/output) and ShareGPT format (conversations list) are also widely used. Choose the format that matches the fine-tuning framework you plan to use.

import json

# OpenAI fine-tuning format (JSONL)
# Each line is one training example
openai_example = {
    'messages': [
        {'role': 'system', 'content': 'You are a JSON extraction agent.'},
        {'role': 'user', 'content': 'Extract: "John Smith, age 32, from Seattle, joined 2023-01-15"'},
        {'role': 'assistant', 'content': '{"name": "John Smith", "age": 32, "city": "Seattle", "join_date": "2023-01-15"}'}
    ]
}

# Alpaca format
alpaca_example = {
    'instruction': 'Extract structured data from the following text.',
    'input': 'John Smith, age 32, from Seattle, joined 2023-01-15',
    'output': '{"name": "John Smith", "age": 32, "city": "Seattle", "join_date": "2023-01-15"}'
}

# Write as JSONL
with open('train.jsonl', 'w') as f:
    f.write(json.dumps(openai_example) + '\n')
    # Add more examples here...

All lessons in this course

When Fine-Tuning Beats Prompting
Preparing a High-Quality Training Dataset
LoRA Fine-Tuning with Hugging Face PEFT
Evaluating and Deploying Your Fine-Tuned Model

← Back to AI Engineering Academy