Technical Guide 7 min read

LLM Fine-Tuning: When to Do It and How to Do It Right

The practical guide to fine-tuning large language models — when it outperforms prompting, what it costs, and how to execute.

Fine-tuning is one of the most misunderstood techniques in applied AI. Many teams jump to fine-tuning as a solution for poor model performance without first exhausting prompt engineering — which is almost always faster and cheaper. But when fine-tuning is the right tool, it can achieve capabilities that prompting alone cannot. Here's the decision framework.

When Fine-Tuning Is the Right Choice

Fine-tuning is the right choice in four scenarios. First, when you need consistent style or format that prompt engineering can't reliably achieve — fine-tuning a model on examples of the desired style is more reliable than prompt instructions alone. Second, when latency or cost requirements make using large models impractical — fine-tuned smaller models can match large model quality on specific tasks at a fraction of the cost. Third, when you need domain-specific knowledge not present in the base model's training data. Fourth, when you have thousands of examples of high-quality task performance that can teach the model a skill.

Fine-tuning is the wrong choice when: you haven't yet fully optimised your prompts, you have fewer than 50–100 high-quality examples, the task changes frequently (you'd need to retrain constantly), or the cost of the fine-tuning process isn't justified by the performance improvement.

Data Preparation: The Most Important Step

Fine-tuning quality is almost entirely determined by training data quality. Garbage in, garbage out — a fine-tuned model learns to replicate whatever patterns are in your training data, including errors and inconsistencies.

Gold-standard training data has three characteristics: correctness (every example represents the desired behaviour), diversity (the examples cover the range of inputs the model will see in production), and volume (enough examples to reliably teach the target skill). For most fine-tuning tasks, 500–2000 high-quality, diverse examples is sufficient. Focus on quality over quantity — 200 excellent examples consistently outperform 2000 mediocre ones.

LoRA and Parameter-Efficient Fine-Tuning

Full fine-tuning — updating all model weights — is computationally expensive and risks catastrophic forgetting of base model capabilities. LoRA (Low-Rank Adaptation) and its quantised variant QLoRA solve both problems: they update only a small set of adapter weights, leaving base model weights frozen. The result is efficient fine-tuning that preserves general capabilities while adding task-specific performance.

LoRA fine-tuning a 7B model requires approximately one A100 GPU and 2–8 hours of compute, depending on dataset size. QLoRA reduces VRAM requirements by ~75%, making fine-tuning accessible on consumer GPUs. The LoRA adapters are small files (tens of MB) that can be swapped on top of the base model — making it practical to maintain multiple fine-tuned variants for different tasks.

Evaluation and Deployment

Fine-tuned models must be rigorously evaluated before deployment. Split your dataset into training (80%), validation (10%), and held-out test (10%) sets. Evaluate the fine-tuned model against the base model on your test set using both automated metrics (task-specific accuracy measures) and human evaluation (does the output look right to a domain expert?).

Watch for overfitting — a model that performs excellently on your test set but poorly on real production inputs. Ensure your test set contains inputs genuinely different from training examples. Before production deployment, compare fine-tuned model performance against a well-optimised prompt with the base model — the performance gap needs to justify the added complexity of maintaining a fine-tuned model.