LoRA and QLoRA for Cost-Efficient Tuning
Train only low-rank adapters on a quantized base — fits a 70B fine-tune on a single GPU.
Why PEFT?
Full fine-tuning of Llama 70B needs 8+ H100s and updates 70 billion parameters. PEFT (Parameter-Efficient Fine-Tuning) updates < 1% of parameters with similar quality. Massive cost savings.
LoRA: Low-Rank Adaptation
LoRA (Hu et al. 2021) trains tiny "adapter" matrices alongside frozen base weights:
- Add A x B matrices where A is dxr and B is rxd, r is small (8, 16, 64)
- Forward pass: y = Wx + (BA)x
- Only train A and B — orders of magnitude fewer parameters
All lessons in this course
- When Fine-Tuning Beats Prompting
- Data Collection: Trajectories and Trace Replay
- LoRA and QLoRA for Cost-Efficient Tuning
- Evaluating Tuned Models vs Base