Preparing a High-Quality Training Dataset
Collect, clean, and format instruction-following data in the Alpaca and ShareGPT formats, apply data deduplication, and split into train and validation sets.
Data Is the Most Important Fine-Tuning Factor
In fine-tuning, the quality of your training data matters more than any hyperparameter, architecture choice, or training technique. 100 high-quality examples outperform 10,000 mediocre ones. Garbage in, garbage out — the fine-tuned model will faithfully reproduce whatever patterns exist in your data, including mistakes, biases, and format inconsistencies. Investing in data quality is the highest-leverage action in any fine-tuning project.
Instruction-Following Formats
Most fine-tuning for instruction-following tasks uses a conversational message format with system, user, and assistant roles. OpenAI's fine-tuning API uses JSONL files where each line is a complete conversation example. The Alpaca format (instruction/input/output) and ShareGPT format (conversations list) are also widely used. Choose the format that matches the fine-tuning framework you plan to use.
import json
# OpenAI fine-tuning format (JSONL)
# Each line is one training example
openai_example = {
'messages': [
{'role': 'system', 'content': 'You are a JSON extraction agent.'},
{'role': 'user', 'content': 'Extract: "John Smith, age 32, from Seattle, joined 2023-01-15"'},
{'role': 'assistant', 'content': '{"name": "John Smith", "age": 32, "city": "Seattle", "join_date": "2023-01-15"}'}
]
}
# Alpaca format
alpaca_example = {
'instruction': 'Extract structured data from the following text.',
'input': 'John Smith, age 32, from Seattle, joined 2023-01-15',
'output': '{"name": "John Smith", "age": 32, "city": "Seattle", "join_date": "2023-01-15"}'
}
# Write as JSONL
with open('train.jsonl', 'w') as f:
f.write(json.dumps(openai_example) + '\n')
# Add more examples here...All lessons in this course
- When Fine-Tuning Beats Prompting
- Preparing a High-Quality Training Dataset
- LoRA Fine-Tuning with Hugging Face PEFT
- Evaluating and Deploying Your Fine-Tuned Model