Fine tune Gemma 3, Qwen3, Llama 4, Phi 4 and Mistral Small with Unsloth and Transformers

Transcribed Jun 15, 2026 Watch on YouTube ↗

Advanced 20 min read For: Machine learning engineers and data scientists with experience in fine-tuning LLMs.

24.7K

Views

762

Likes

65

Comments

8

Dislikes

3.3%

📈 Moderate

AI Summary

This video provides a comprehensive guide to fine-tuning open-source LLMs like Gemma 3, Qwen 3, Llama 4, Phi 4, and Mistral Small. It compares Unsloth and Transformers libraries, demonstrates fast evaluation with VLLM, and offers practical tips on hyperparameters and data preparation. The tutorial includes a live demo with real troubleshooting.

Chapters

1 Introduction and Why Fine-Tune 00:00 2 Data Preparation Overview 03:21 3 Unsloth vs Transformers 06:26 4 Fast Evaluation with VLLM 11:08 5 Model Selection and Fine-Tuning Tips 12:27 6 Live Demo: Baseline Evaluation 16:03 7 Live Demo: Fine-Tuning with Unsloth 38:40 8 Post Fine-Tuning Evaluation and Conclusion 63:39

[00:00]

Video Overview

Explains fine-tuning of latest open-source models, pros/cons of Unsloth vs Transformers, fast evaluation using VLLM, and hyperparameter tuning.

[01:36]

Why Fine-Tune?

Fine-tuning is a last resort after prompt engineering and retrieval. It improves answer structure, tool calling, accuracy beyond retrieval, and domain-specific reasoning.

[03:21]

Data Preparation Types

Two types: continued pre-training (raw data, difficult) and post-training (Q&A pairs, recommended for small data). Synthetic data generation using LLMs is advised.

[05:17]

Fine-Tuning Gets Trickier

Stronger models are harder to fine-tune; risk of regressing performance if data doesn't match model's reasoning style.

[06:26]

Unsloth vs Transformers

Unsloth is 2x faster, unified multimodal loading, but single GPU only. Transformers supports multi-GPU, more documentation, and easier access to advanced features.

[11:08]

Fast Evaluation with VLLM

Use VLLM for inference during evaluation; it's much faster than Transformers/Unsloth. Requires reloading the model after fine-tuning.

[12:27]

Which Model to Fine-Tune

Mistral Small (Apache 2, strong) is top recommendation. Gemma 3 (custom license) and Llama 4 (large, custom license) are alternatives. Qwen 3 (Apache 2, strong but censorship/backdoor risks).

[14:32]

General Fine-Tuning Tips

Spend 80-90% time on data prep. Define two eval sets (representative and verbatim copy) to measure overfitting. Inspect chat template for unwanted elements like dates.

[16:03]

Scripts and Setup

Scripts available at Trelis.com advanced fine-tuning repo. Three scripts: VLLM+Unsloth, VLLM+Transformers, pure Transformers. Uses RunPod with H100 GPU.

[19:57]

Baseline Evaluation

Evaluates Phi-4-mini on Touch Rugby QA dataset. Baseline score: ~5.3 correct out of 32 (multiple runs).

[38:40]

Fine-Tuning with Unsloth

Sets up LoRA adapters (rank 32), trains attention and MLP modules, uses custom scheduler (constant then linear decay). Training loss and eval loss decrease.

[59:52]

Post Fine-Tuning Evaluation

After fine-tuning with Transformers (since Unsloth model had VLLM compatibility issues), score improved to ~7.3 correct, showing positive effect.

Fine-tuning can improve model performance on domain-specific tasks, but requires careful data preparation and hyperparameter tuning. Unsloth offers speed and ease, while Transformers provides flexibility and broader compatibility.

Clickbait Check

95% Legit

"Title accurately describes the content: covers multiple models, both Unsloth and Transformers, and includes a detailed demo."

Mentioned in this Video

Unsloth

tool

Transformers

tool

VLLM

tool

RunPod

tool

Gemini API

service

Advanced Fine-Tuning Repo

link

Tutorial Checklist

1 16:03 Access scripts from Trelis.com advanced fine-tuning repo.

2 18:01 Set up a GPU pod on RunPod using a one-click template (e.g., H100).

3 18:50 Upload the Unsloth or Transformers notebook and requirements file to the pod.

4 20:04 Install dependencies: VLLM, Unsloth, etc. Restart kernel after installs.

5 21:53 Log into Hugging Face using a token.

6 22:09 Set model slug (e.g., 'microsoft/Phi-4-mini-instruct') and dataset (Trelis/touch-rugby-comprehensive).

7 23:40 Set up judge LLM (e.g., Gemini Flash) with API key for evaluation.

8 24:33 Load and inspect dataset; set test mode to false for full eval.

9 25:35 Load base model with VLLM for inference evaluation.

10 27:17 Run baseline evaluation: generate answers with VLLM, judge with Gemini, compute accuracy.

11 38:40 Switch to fine-tuning: uninstall VLLM, install Unsloth, restart kernel.

12 40:08 Set fine-tuning parameters: model, max_seq_length=8000, load_in_4bit=False (use 16-bit).

13 41:02 Load model with Unsloth's FastLanguageModel, print padding side and model architecture.

14 43:20 Inspect matrix dimensions to set LoRA alpha (e.g., sqrt(3000) ≈ 55).

15 45:56 Create PEFT model with LoRA: rank=32, target modules (e.g., q_proj, o_proj, gate_proj, up_proj, down_proj), use_rslora=True.

16 49:06 Load fine-tuning dataset and format prompts with chat template.

17 50:16 Set training arguments: batch_size=4, gradient_accumulation_steps=4, epochs=2, learning_rate=1e-4, custom scheduler (constant then linear decay).

18 53:22 Initialize trainer with model, tokenizer, datasets, and formatting function.

19 56:22 Configure loss masking to train only on assistant responses (completion tokens).

20 58:46 Start training; monitor loss curves via TensorBoard.

21 60:42 Save and push fine-tuned model to Hugging Face Hub.

22 63:39 Switch back to evaluation: uninstall Unsloth, reinstall VLLM, restart kernel.

23 64:09 Load fine-tuned model with VLLM (use model slug from Hub).

24 64:23 Run post-fine-tuning evaluation on the same eval dataset; compare accuracy.

Study Flashcards (10)

What are the two main libraries compared for fine-tuning in this video?

easy Click to reveal answer

Unsloth and Transformers.