TubeSum ← Transcribe a video

Fine tune Gemma 3, Qwen3, Llama 4, Phi 4 and Mistral Small with Unsloth and Transformers

Transcribed Jun 15, 2026 Watch on YouTube ↗
Advanced 20 min read For: Machine learning engineers and data scientists with experience in fine-tuning LLMs.
24.7K
Views
762
Likes
65
Comments
8
Dislikes
3.3%
📈 Moderate

AI Summary

This video provides a comprehensive guide to fine-tuning open-source LLMs like Gemma 3, Qwen 3, Llama 4, Phi 4, and Mistral Small. It compares Unsloth and Transformers libraries, demonstrates fast evaluation with VLLM, and offers practical tips on hyperparameters and data preparation. The tutorial includes a live demo with real troubleshooting.

[00:00]
Video Overview

Explains fine-tuning of latest open-source models, pros/cons of Unsloth vs Transformers, fast evaluation using VLLM, and hyperparameter tuning.

[01:36]
Why Fine-Tune?

Fine-tuning is a last resort after prompt engineering and retrieval. It improves answer structure, tool calling, accuracy beyond retrieval, and domain-specific reasoning.

[03:21]
Data Preparation Types

Two types: continued pre-training (raw data, difficult) and post-training (Q&A pairs, recommended for small data). Synthetic data generation using LLMs is advised.

[05:17]
Fine-Tuning Gets Trickier

Stronger models are harder to fine-tune; risk of regressing performance if data doesn't match model's reasoning style.

[06:26]
Unsloth vs Transformers

Unsloth is 2x faster, unified multimodal loading, but single GPU only. Transformers supports multi-GPU, more documentation, and easier access to advanced features.

[11:08]
Fast Evaluation with VLLM

Use VLLM for inference during evaluation; it's much faster than Transformers/Unsloth. Requires reloading the model after fine-tuning.

[12:27]
Which Model to Fine-Tune

Mistral Small (Apache 2, strong) is top recommendation. Gemma 3 (custom license) and Llama 4 (large, custom license) are alternatives. Qwen 3 (Apache 2, strong but censorship/backdoor risks).

[14:32]
General Fine-Tuning Tips

Spend 80-90% time on data prep. Define two eval sets (representative and verbatim copy) to measure overfitting. Inspect chat template for unwanted elements like dates.

[16:03]
Scripts and Setup

Scripts available at Trelis.com advanced fine-tuning repo. Three scripts: VLLM+Unsloth, VLLM+Transformers, pure Transformers. Uses RunPod with H100 GPU.

[19:57]
Baseline Evaluation

Evaluates Phi-4-mini on Touch Rugby QA dataset. Baseline score: ~5.3 correct out of 32 (multiple runs).

[38:40]
Fine-Tuning with Unsloth

Sets up LoRA adapters (rank 32), trains attention and MLP modules, uses custom scheduler (constant then linear decay). Training loss and eval loss decrease.

[59:52]
Post Fine-Tuning Evaluation

After fine-tuning with Transformers (since Unsloth model had VLLM compatibility issues), score improved to ~7.3 correct, showing positive effect.

Fine-tuning can improve model performance on domain-specific tasks, but requires careful data preparation and hyperparameter tuning. Unsloth offers speed and ease, while Transformers provides flexibility and broader compatibility.

Clickbait Check

95% Legit

"Title accurately describes the content: covers multiple models, both Unsloth and Transformers, and includes a detailed demo."

Mentioned in this Video

Tutorial Checklist

1 16:03 Access scripts from Trelis.com advanced fine-tuning repo.
2 18:01 Set up a GPU pod on RunPod using a one-click template (e.g., H100).
3 18:50 Upload the Unsloth or Transformers notebook and requirements file to the pod.
4 20:04 Install dependencies: VLLM, Unsloth, etc. Restart kernel after installs.
5 21:53 Log into Hugging Face using a token.
6 22:09 Set model slug (e.g., 'microsoft/Phi-4-mini-instruct') and dataset (Trelis/touch-rugby-comprehensive).
7 23:40 Set up judge LLM (e.g., Gemini Flash) with API key for evaluation.
8 24:33 Load and inspect dataset; set test mode to false for full eval.
9 25:35 Load base model with VLLM for inference evaluation.
10 27:17 Run baseline evaluation: generate answers with VLLM, judge with Gemini, compute accuracy.
11 38:40 Switch to fine-tuning: uninstall VLLM, install Unsloth, restart kernel.
12 40:08 Set fine-tuning parameters: model, max_seq_length=8000, load_in_4bit=False (use 16-bit).
13 41:02 Load model with Unsloth's FastLanguageModel, print padding side and model architecture.
14 43:20 Inspect matrix dimensions to set LoRA alpha (e.g., sqrt(3000) ≈ 55).
15 45:56 Create PEFT model with LoRA: rank=32, target modules (e.g., q_proj, o_proj, gate_proj, up_proj, down_proj), use_rslora=True.
16 49:06 Load fine-tuning dataset and format prompts with chat template.
17 50:16 Set training arguments: batch_size=4, gradient_accumulation_steps=4, epochs=2, learning_rate=1e-4, custom scheduler (constant then linear decay).
18 53:22 Initialize trainer with model, tokenizer, datasets, and formatting function.
19 56:22 Configure loss masking to train only on assistant responses (completion tokens).
20 58:46 Start training; monitor loss curves via TensorBoard.
21 60:42 Save and push fine-tuned model to Hugging Face Hub.
22 63:39 Switch back to evaluation: uninstall Unsloth, reinstall VLLM, restart kernel.
23 64:09 Load fine-tuned model with VLLM (use model slug from Hub).
24 64:23 Run post-fine-tuning evaluation on the same eval dataset; compare accuracy.

Study Flashcards (10)

What are the two main libraries compared for fine-tuning in this video?

easy Click to reveal answer

Unsloth and Transformers.

06:26

How much faster is Unsloth compared to Transformers for fine-tuning?

easy Click to reveal answer

Two times faster.

07:09

What is a key limitation of Unsloth regarding GPU support?

easy Click to reveal answer

Unsloth is single GPU only.

07:47

Why is VLLM recommended for evaluation instead of Transformers or Unsloth?

medium Click to reveal answer

VLLM supports continuous batching, making inference significantly faster.

11:16

What are the two types of training data sets mentioned?

medium Click to reveal answer

Continued pre-training (raw data) and post-training (Q&A pairs).

03:34

What is the recommended LoRA alpha value based on matrix dimensions?

hard Click to reveal answer

The square root of the smallest matrix dimension.

44:10

What is the purpose of using two evaluation data sets (representative and verbatim copy)?

medium Click to reveal answer

To measure overfitting by comparing performance on unseen data vs. training data.

14:44

Which model is recommended as the top choice for fine-tuning and why?

medium Click to reveal answer

Mistral Small, because it has an Apache 2 license and strong evaluation performance.

12:30

What issue can occur when fine-tuning a reasoning model without reasoning data?

hard Click to reveal answer

The model's performance may regress below its original reasoning performance.

05:36

What is the recommended bit precision for fine-tuning according to the presenter?

easy Click to reveal answer

16-bit (not 4-bit).

10:05

💡 Key Takeaways

⚖️

Fine-tuning as last resort

Emphasizes that fine-tuning should only be used after prompt engineering and retrieval, a key principle for efficient ML workflows.

01:36
💡

Fine-tuning gets trickier with stronger models

Highlights a nuanced challenge: stronger models are harder to fine-tune without regressing performance.

05:17
📊

Unsloth is 2x faster

Quantifies the speed advantage of Unsloth over Transformers, a practical consideration for practitioners.

07:09
🔧

VLLM for fast evaluation

Introduces a practical technique to speed up evaluation using VLLM's continuous batching.

11:16
🔧

Two eval sets to measure overfitting

Provides a concrete method to detect overfitting by comparing performance on representative and verbatim eval sets.

14:44

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Fine-tune 5 top open-source models now!

34s

Immediately lists the hottest models (Gemma 3, Qwen3, Llama 4, Phi 4, Mistral Small) and promises a practical guide, hooking AI enthusiasts.

▶ Play Clip

Unsloth vs Transformers: The truth

60s

Directly compares two popular fine-tuning libraries, revealing speed gains and limitations, which sparks debate among practitioners.

▶ Play Clip

Why fine-tuning is a last resort

50s

Challenges the common assumption that fine-tuning is always the answer, offering a contrarian perspective that resonates with experienced ML engineers.

▶ Play Clip

Model selection: Mistral vs Gemma vs Qwen

50s

Provides a ranked list of models with licensing and performance insights, helping viewers make a critical decision in their projects.

▶ Play Clip

Live fine-tuning demo with Unsloth

50s

Shows a real-time walkthrough of fine-tuning Phi-4, including troubleshooting and evaluation, which is highly educational for hands-on learners.

▶ Play Clip

[00:00] I'm going to explain how to fine-tune the latest open-source models, all the way from

[00:03] Gemma 3, Qwen 3, Llama 4, Phi 4, and Mistral Small.

[00:08] I'll explain the pros and cons of using Unsloth versus using Transformers.

[00:13] I'll explain also how to do fast evaluations using VLLM, separate to the fine-tuning.

[00:19] And then I'll go through in detail some of the techniques around how to set the hyperparameters

[00:24] to get the best results.

[00:26] For a video overview, I'll briefly describe why you might want to fine-tune.

[00:30] Hopefully you've got a sense already if you're watching this video, but I'll recap on that.

[00:35] I'll very briefly describe how to prepare data.

[00:38] You probably should be spending 90% of your time on data preparation, and I have a lot of

[00:42] videos covering that.

[00:44] I will link them here, but it's not going to be the focus of this video.

[00:47] Then I'll talk about Unsloth, which is a wrapper on Transformers.

[00:51] That brings significant improvements, but I'll talk about the pros and cons of using Unsloth

[00:55] versus Transformers for fine-tuning.

[00:57] Then I'll talk a little about running fast evaluations.

[01:01] You should be evaluating your performance before you fine-tune and afterwards and during as well.

[01:06] And I'll show you how you can speed that up a lot by using VLLM in the same notebook as

[01:11] you're doing the fine-tuning.

[01:12] Then I'll talk about which model to fine-tune out of the available open-source ones, considering

[01:17] factors like the license, licenses, and performance.

[01:21] I'll give a few general fine-tuning tips before then going to a live demo of fine-tuning

[01:27] using Unsloth and Transformer notebooks that I've put together in the advanced fine-tuning

[01:32] repo.

[01:34] So why should you fine-tune?

[01:36] Generally, this is ideally a last resort.

[01:40] You have tried out doing prompt engineering.

[01:42] You have included retrieval in your techniques, and you still need to improve performance.

[01:47] Often, that means one of a few things.

[01:49] You need to improve your answer structure and format.

[01:52] For example, you want the model to take a certain approach by maybe recapping on some of the

[01:57] information, processing it, and maybe reasoning over it, and then giving a final structured

[02:02] answer.

[02:03] Sometimes, that means you have tool calling.

[02:07] So you want to have a structured response that is going to call an assistive tool.

[02:12] So these are classic cases where fine-tuning can make sense because it gets the model to

[02:16] very consistently respond in a certain format.

[02:19] Now, there are two others you might consider as well.

[02:21] One is if you want to improve accuracy beyond just using a retrieval method.

[02:25] Back in one of my earlier videos, I show how you can combine retrieval with fine-tuning and

[02:32] get the best possible approach.

[02:33] So you can improve performance beyond just a retrieval approach only.

[02:37] And last of all, this is more modern, late 2024.

[02:41] You can try to fine-tune for specific reasoning within a domain.

[02:47] That's like GRPO, Group Relative Policy Optimization, and other related techniques.

[02:52] I have a series of videos from earlier in 2025 that cover that, and I'll let you try them

[02:57] out there.

[02:57] For today, though, we're going to focus not so much on reasoning, although I think I will

[03:02] make another reasoning video soon.

[03:04] We're going to focus on general fine-tuning to enhance knowledge.

[03:09] But a lot of the same principles apply if you wanted to fine-tune on structure, although

[03:13] I'll point you for more details on fine-tuning for JSON responses or function calling to this

[03:19] video right here.

[03:21] I'm not going to talk too much about data preparation, but I will point you to, well,

[03:27] I'll give you a few tips and point you to the key videos.

[03:29] Basically, there are two types of training and two types of data sets correspondingly at a

[03:34] very simple level.

[03:35] There is what's called continued pre-training with raw data.

[03:37] This is where you take, say, magazine content, newsletter articles, or books, and you feed that

[03:44] in without much pre-processing.

[03:46] I mean, you'll clean it up, but you're not going to change it into Q&A pairs.

[03:50] This continued pre-training here, typically this is difficult to do on top of an existing

[03:56] model because it can tend to undo the instruction type training that a model has.

[04:01] So often, unless you have a very large amount of data and you're willing to do continued

[04:06] pre-training followed by post-training, then it's probably not recommended to try and do

[04:11] the continued pre-training.

[04:12] Instead, what generally is recommended, particularly with smaller amounts of data, like, say,

[04:16] up to maybe 100,000 words or even a million, you do post-training on question and answer

[04:24] type data sets.

[04:25] And these are often data sets you synthetically create using documents and using LLMs to create

[04:31] questions and answers from those documents.

[04:32] Now, to prepare a synthetic data set, it's recommended to use a large language model.

[04:38] You want to generate not just questions and answers, but ideally questions, evaluation

[04:43] criteria, so what the criteria are for correct answers, and then high-quality answers.

[04:48] I have a video that came out recently around how you can prepare data this way.

[04:52] If you further want the answers to involve reasoning or to involve chain of thought, you probably

[04:58] need to augment them further.

[05:00] And I do have a video showing how, with these augmented answers, you can get up to very high

[05:06] levels of accuracy with fine-tuned models.

[05:08] And that's true whether you're fine-tuning open source or using APIs for fine-tuning models

[05:12] like OpenAI's API.

[05:14] Now, one caveat.

[05:17] I don't say this with 100% confidence, maybe 75% confidence.

[05:23] My sense is that as models get stronger, it's harder to fine-tune.

[05:26] And that's because the model starts off from a very good point.

[05:30] And in fine-tuning, you are risking damaging the model in some ways.

[05:35] I'll give you a simple example.

[05:36] If you take a reasoning model, it might perform quite well.

[05:39] And if you train it without reasoning data, you could just drag the performance below what

[05:44] the reasoning performance was, even though you're adding the right content to it.

[05:48] So I think in some ways, fine-tuning is getting a bit trickier.

[05:51] And if you are going to try and fine-tune for reasoning, and I mean reasoning not in a, say,

[05:59] technical application where you can use GRPO or even SFT like I described in one of my previous

[06:05] videos.

[06:06] Let's see this one here.

[06:08] But if you're trying to use reasoning in kind of a verbal domain, that's going to be tricky.

[06:13] It's something I want to cover maybe in a later video, but just kind of watch out, because

[06:17] if you don't have proper reasoning data sets developed, you do risk maybe regressing performance.

[06:23] Okay, so I'm going to talk about the libraries.

[06:26] There are quite a few libraries out there.

[06:28] Some of the most common are at least used at the smaller scale are Unsloth and Transformers.

[06:35] There are also libraries like Axotal, TorchTune.

[06:39] Those are two other examples.

[06:43] Maybe I'll go through those at some point.

[06:45] But the one that I've been using most often on this channel is either Transformers or Unsloth.

[06:49] Unsloth is effectively a wrapper.

[06:51] I mean, I don't mean that in a negative way.

[06:54] It's brought a lot of improvements to the Transformers library in terms of speed and just ease of use.

[06:59] So I want to just recap those here.

[07:01] I'm going to show you notebooks that do Unsloth only and then Transformers only, but it's worth

[07:07] appreciating the differences.

[07:09] So Unsloth generally is two times faster than Transformers, and that's because of a variety of tricks.

[07:16] It's not just one thing.

[07:17] It's like the accumulation of about five or six tricks that result in faster fine-tuning.

[07:22] Also, Unsloth provides a unified function for loading multimodal models.

[07:28] So if you have a model like Gemma 3, which is multimodal in sizes larger than 4B,

[07:33] you need to change the function that you use to import that model if you're going to use Transformers,

[07:38] whereas with Unsloth, it's just unified how you load that model.

[07:41] So that makes it a lot easier to have one script that supports a lot of different models.

[07:44] Unsloth for now is single GPU only.

[07:47] So if you have a very large model, you might need to use Transformers instead, which does

[07:53] support multi-GPU.

[07:54] By default, Transformers will use model parallel, so it will chunk up the model by layers, which

[08:01] means your GPUs are not all active when you're fine-tuning because you will use one GPU to

[08:06] process these layers, then the next GPU, then the next.

[08:09] That might sound inefficient, and it is, but it's quite simple and robust, so I wouldn't

[08:15] rule that out if you need to tune a larger model.

[08:18] Just use that simple approach.

[08:20] Now, you can take improved approaches called fully sharded using a fully sharded data parallel

[08:27] approach.

[08:28] That's where you split the model across GPUs, but you split the matrices essentially so that

[08:34] all GPUs are being used more or less at once.

[08:37] And you can find a video on fully sharded data parallel on this channel if you just look up

[08:42] FSDP.

[08:43] Something else about Unsloth, because it's a wrapper, sometimes there are issues where Transformers

[08:52] will move ahead, and maybe Unsloth does not quite support that.

[08:55] Also, if you're trying to use AI to help you code, there will tend to be more documentation

[09:00] on the advanced features in Transformers, and because everything is being wrapped by Unsloth,

[09:06] it's sometimes harder to access the features by working through Unsloth.

[09:09] So basically what I'm saying is if you're trying to use some more obscure functionality in Transformers,

[09:15] it may be easier to use Transformers than to try and use Unsloth and have to figure out how Unsloth has wrapped that.

[09:21] For now also, JAMA 3 is still broken in the sense that the configuration file won't allow you to run

[09:28] inference on VLLM, so if you do fine-tune a JAMA 3 model on Unsloth, I'll point you to an issue where

[09:34] there's potentially a fix.

[09:36] I expect that it will be fixed at some point, but something to keep a note on.

[09:41] And also, you should note that GPUs are getting very big, like a B200, which you can rent on RunPod now,

[09:47] I think, for $8 an hour.

[09:48] It's 192 gigabytes in VRAM.

[09:52] So if you have a model, you can probably fit a model that's up to 150 billion parameters in 8-bits.

[09:59] Unsloth allows you to fine-tune in 8-bits now, not just in 4-bits.

[10:03] I don't recommend 4-bit.

[10:05] I know it's very popular doing Qlora,

[10:07] but I have found, particularly because of merging back adapters, you can see small differences in

[10:13] performance that are hard to predict.

[10:15] So I generally recommend fine-tuning in 16-bits.

[10:18] 8-bit is probably quite good as well.

[10:21] So yeah, if you have a model that's 16-bits, you can train up to probably 80 or 70 billion parameters

[10:28] on a B200 with Unsloth, or in 8-bits, probably even something larger.

[10:33] So this is not even that much of a limitation.

[10:35] You're not going to be able to fine-tune DeepSeek using Unsloth.

[10:39] Now, DeepSeek anyway is probably hard to fine-tune, even with Transformers.

[10:43] But yeah, this is a limitation.

[10:47] If you wanted to fine-tune, let's say, Llama 4 and the Maverick version, you're not going to fit that on a single GPU.

[10:53] But you could fit the Scout version in a single GPU in 8-bits, because it's 100 billion parameters, roughly.

[10:59] And it's about one byte per parameter, so in 8-bits.

[11:03] So that would be fitting onto that GPU.

[11:06] Just a note on running evaluations.

[11:08] So you want to run evaluations before you fine-tune and after, and ideally during, to see if your fine-tuning is working.

[11:16] You can just run inference using Transformers or Unsloth, but neither of these libraries are designed for inference, so they don't do what's called continuous batching.

[11:25] You can send in a batch of tokens, so you have to size the batch so you don't run out of memory.

[11:30] You can slightly automate that, actually, through Transformers by having a test for the right batch size, and it will reduce it if it's too large.

[11:38] But the inference is just not optimal in terms of the back-end, and it will be significantly slower than using something like VLLM or SG-Lang.

[11:47] So I recommend, and that's why the scripts I'll show you today, they have two parts.

[11:50] They've got an inference part or an evaluation part that's run with VLLM, and then they've got a fine-tuning part that's either Transformers or Unsloth.

[11:58] And it's much faster to use VLLM.

[12:00] The drawback is you have to reload the model in VLLM after you've fine-tuned in Unsloth or Transformers.

[12:06] So there's kind of this trade-off.

[12:08] If you have a very big model, it can take a bit of time to load.

[12:11] But again, if you're only on one GPU, this is probably not going to be a big constraint, and reloading the model should be fairly fast,

[12:18] given it's already going to be on your disk from the fine-tuning.

[12:22] Now for some questions on which model to fine-tune.

[12:27] And I've listed them in a tentative order of preference.

[12:30] Mistral, I think, is Mistral small.

[12:33] It's less than 30 billion parameters.

[12:36] It's an Apache 2 license.

[12:38] And it tends to be strong in evaluations, just I've heard from customers as well,

[12:43] when I see results across the different models they've tried.

[12:46] So it would be one of my top recommendations.

[12:48] Gemma 3 is a very strong model as well, but the license is custom.

[12:53] So if you're at a bigger company where there's sign-off on the general open-source licenses,

[12:58] but there needs to be review of custom licenses, this is maybe adding a little bit more friction.

[13:03] And I've added some notes here.

[13:05] You can just click on these links to get more info on the licenses.

[13:09] Llama 4 from Microsoft is a permissive license.

[13:12] It also allows for reasoning.

[13:14] These top two models don't.

[13:15] So if you want to fine-tune for reasoning, maybe 5.4 is a good option.

[13:18] Llama 4 is a custom license, and it's also very large.

[13:22] The Scout model is 100 billion parameters, and I think it's unnecessarily big for the quality it provides.

[13:28] The quality is probably not much better than Gemma 27B, Gemma 3, or potentially even Mistral.

[13:35] So I recommend probably just using Mistral or Gemma 3 over Llama 4.

[13:40] Qwen 3 is a very strong model.

[13:43] I think it is probably stronger than all of these models here, and it's Apache 2.

[13:48] But you do have the issues that come with using Qwen or DeepSeek models,

[13:53] that there is strong censorship of the models, and also there is a backdoor risk with any language model here.

[14:00] But you have to weigh that in the context of where and how it was developed.

[14:04] These models are increasingly being used to control agents,

[14:08] and that provides an extra angle by which you can have danger if the model is controlling your agent in a malicious way

[14:15] and making tool calls that you don't want to be made.

[14:18] So Qwen 3, very strong, perhaps very good choice if you want to fine-tune for reasoning,

[14:22] but you do have to be careful of the censorship and the backdoor issues.

[14:26] Now, just a few general fine-tuning tips before I move to a demo.

[14:32] Spend 80% or 90% plus of your time on data preparation.

[14:37] That probably means watching some of the other videos.

[14:40] Second of all, define two evaluation data sets.

[14:44] One is a representative data set that's not in your training set.

[14:48] I explained in my data prep video how you can rephrase certain questions

[14:52] in order to make sure they're not verbatim in your training and your eval set.

[14:56] But I do recommend including a verbatim copy of some of your training data set,

[15:02] because by including a verbatim copy and also a version, a data set that's not in your training set,

[15:09] you can start to measure the difference in performance between these two and assess whether you are overfitting.

[15:16] Measuring overfitting is another reason to use the eval set during training.

[15:20] And then one kind of random tip here is do inspect the chat template being used when you're fine-tuning.

[15:32] Some chat templates have got the date included with them.

[15:36] And if they have the date, you probably don't want to be fine-tuning on today's date for all of your examples.

[15:42] So you may want to remove that when you're doing the fine-tuning.

[15:47] So other things that can appear unexpectedly may also adversely affect the quality.

[15:52] Okay, so that's it for the theory portion.

[15:57] I'm going to move now and show you how to fine-tune.

[16:03] And if you want to find the scripts I'm going to show you,

[16:06] they're available at Trelis.com and then advanced fine-tuning.

[16:11] And I've actually refreshed the repository.

[16:14] Let me show you here.

[16:16] I've just cloned it over using Windsurf.

[16:19] Historically, the fine-tuning repo was organized according to branches.

[16:23] So different branches would have different content.

[16:26] And those branches are still there.

[16:28] For example, if you want synthetic scripts,

[16:31] scripts for making a synthetic data,

[16:32] synthetic branch, distillation,

[16:34] low VRAM, full fine-tuning,

[16:36] retrieval, rag,

[16:38] using Wikipedia data for fine-tuning.

[16:42] There are a large variety of scripts and they're still there in different branches.

[16:44] But going forward, I'm going to leave all of the scripts within the main branch here.

[16:49] And I've started to create a clean folder for data prep.

[16:53] That was the most recent video.

[16:54] And now I've got a clean folder here for fine-tuning.

[16:57] And this is in VLLM unsloth,

[17:01] but I'm going to merge it into main.

[17:02] So you'll be able to find it there after the video.

[17:04] Now, there are three fine-tuning scripts that I have prepared.

[17:10] One is VLLM unsloth.

[17:13] It uses VLLM for evaluation and then unsloth for fine-tuning.

[17:16] This one here is VLLM transformers.

[17:19] So it uses VLLM for eval and transformers for fine-tuning.

[17:24] And then this one here is pure transformers.

[17:26] So it will just use transformers both for evaluation and for fine-tuning.

[17:32] And it will automatically set the right batch size for evaluation.

[17:35] But the evaluation is quite a bit slower than if you're using VLLM.

[17:41] So I'm going to show you a fine-tune using the unsloth script here.

[17:47] The transformer script more or less follows it.

[17:50] And when I go through this, I'll highlight a few points

[17:52] where there's a difference between using unsloth and the fine-tuning.

[17:56] Now, to get going on a GPU, I'm going to use RunPod

[18:01] and this one-click template affiliate link I have here.

[18:04] If you want to exactly replicate the environment I'm using,

[18:08] you can use this and pick a GPU.

[18:11] Now, here, I said 192.

[18:14] That was wrong.

[18:14] Sorry, I meant, I guess, what I should have said was 180.

[18:18] So if you run a B200 for $7, it looks like, yeah, $8 an hour.

[18:23] Maybe I said $7.

[18:24] You can fit 180 gigabytes of VRAM.

[18:27] We're just going to run with a H100, which is 80 gigabytes.

[18:30] We won't be running quite as big a model.

[18:32] So I'll just say fine-tuning with unsloth as the name.

[18:36] And you can see here that we've got a CUDA 12.1

[18:41] and PyTorch 2.2 template.

[18:43] So I'll get this going.

[18:45] Okay, so once the pod has started,

[18:47] we're going to connect and open up Jupyter.

[18:50] And then I'm going to upload my notebook, the unsloth one.

[18:55] Now, just when I'm uploading unsloth here,

[18:58] notice that I can also upload a requirements file.

[19:02] So you can either install the latest version of the dependencies

[19:05] or you can install from the requirements file

[19:08] if you want to ensure reproducibility.

[19:10] Okay, so I apologize for hurting the eyes of those who are sensitive to white light.

[19:16] I have swapped this over to dark mode.

[19:19] And we're going to start off with an evaluation.

[19:23] Throughout this video, we're going to use a HuggingFace data set.

[19:28] Well, it's a Trelis data set called Touch Rugby and Comprehensive.

[19:34] It's the QA data set I generated in a recent video.

[19:37] And it consists of a series of questions and answers

[19:42] and then evaluation criteria for marking a given answer correct.

[19:48] And these were all generated using, I think, the Gemini Pro 2.5 model

[19:54] based on a document with the Touch Rugby rules.

[19:57] So the first thing I'm going to do is I'm going to run some installations here.

[20:04] Sometimes if you've run the fine tuning first with unsloth,

[20:09] you might need to uninstall unsloth.

[20:11] But since we're starting fresh here, we don't need to uninstall unsloth.

[20:16] We'll just run ahead and do the installs, the most important of which is VLLM here.

[20:20] Now, I'm just going to train one model.

[20:23] I'm actually going to train, I think, the Phi Mini Instruct model, Phi 4.

[20:27] It's a new model I haven't trained.

[20:29] So I'm kind of curious what the results will be and see if we hit any issues.

[20:33] But I will give some commentary as we go about some of the other models.

[20:38] So my first comment here is if you're using the QEN model,

[20:42] if you want to fine tune and disable reasoning,

[20:45] using VLLM, you do need the latest install from source of VLLM.

[20:50] It does take quite a bit of time to install.

[20:52] So just a heads up, if you're going to use QEN 3 and you want to disable reasoning,

[20:57] you need to install VLLM from source for now.

[21:02] Right. So while that's installing, we'll move on here.

[21:04] Do restart the kernel after the installs are done.

[21:08] This is where I was saving my requirements to a TXT file.

[21:12] You could, if you wish, rather than running the installs here above,

[21:16] you could have done UVPIP install dash or requirements VLLM slot like this.

[21:23] VLLM on slot and then dash system.

[21:26] Now the dash system means that we're installing onto the system.

[21:30] We're not in a virtual environment.

[21:31] And this makes sense because when we started the Docker image on this GPU,

[21:36] we already had CUDA and PyTorch installed and we want to make use of those.

[21:40] We don't want everything to be packaged in a VNV.

[21:42] Okay.

[21:44] So this is still installing installs are done.

[21:48] So I'm going to restart the kernel and now I'm going to log into HuggingFace.

[21:53] So I'm going to, I won't add it as a Git credential.

[21:57] I will get a token and I've got a token and we're logged in now.

[22:03] Now I can save that.

[22:05] At least close it.

[22:07] And the model that we're going to evaluate.

[22:09] So we'll evaluate the base model.

[22:11] Then we'll do fine tuning and then we'll evaluate again.

[22:13] So the model I want to test is going to be a base model.

[22:18] And I'm going to set model slug equal to the phi model.

[22:23] And I've got the phi model up over here.

[22:26] And yeah, the dark mode is pretty poor on Safari.

[22:29] So I'll paste it in.

[22:31] And the data set is the touch rugby data set.

[22:35] The training split is called train.

[22:37] The eval split is called eval.

[22:39] There is a mirror eval split.

[22:41] This is literally a subset of the training set.

[22:44] So we can test over fitting.

[22:45] I talked about that in the data set prep video.

[22:48] And within this data set, there's a column for a question.

[22:50] There's a column for evaluation criteria, which we'll use for grading.

[22:54] And there's a column for answer, which we'll use for fine tuning.

[22:57] So the answer is for fine tuning.

[22:58] The evaluation criteria is for grading.

[23:02] And for setting up the judge, we're going to use Gemini Flash, Gemini 2.0.

[23:07] We're going to provide a long context length.

[23:09] It can be shorter, but if you're using reasoning, you need more length.

[23:12] And we'll use a low temperature for the grading here.

[23:16] This allows for fast weight downloads.

[23:18] And this makes sure that we download the weights into the disk.

[23:22] So we want, when we're on RunPod, we want the weights to go on the volume.

[23:26] We don't want it to go on the container.

[23:28] The volume here is about 500 gigabytes.

[23:30] The container is only, I think, maybe 10.

[23:33] So I'll run this.

[23:34] That just sets those variables.

[23:38] And now we're going to set up the judge.

[23:40] Now, it's asking me for an API key for Gemini, which makes sense because we're using the Flash model.

[23:46] So I'm just going to go over to AI Studio.

[23:52] I'll do this off screen.

[23:54] And I'm going to create an API key.

[23:56] So I'll paste in that API key.

[23:59] And that's set now.

[24:01] And just to briefly show you the judge, this is not yet the judge.

[24:04] It's just setting the API key.

[24:06] It's actually saving it to a local environment variables file so that if I rerun this cell, it won't force me to put in the API key again, which is nice.

[24:14] Here, we're setting up an OpenAI client.

[24:18] So we're hitting Gemini using an OpenAI client.

[24:22] And here we've just set up this function called chat that allows us to send messages in with a single message into Gemini.

[24:31] Okay, next we're going to prepare the data set.

[24:33] And when we prepare it, we can turn a test mode on.

[24:38] You'll notice that back here earlier when I defined the data set, there's a flag to set test equals to true.

[24:47] If you just want to look at a small sample of the data set, you can set test equals to true.

[24:51] And it will also print a lot more debug logs down below in the script.

[24:56] So I've got that set to false.

[24:58] And we're going to load the data set with the data set name.

[25:02] Check if there's train and eval splits.

[25:05] And yeah, if we're just inspecting, we're going to slice the data so that we're only going to take a few rows.

[25:13] So here we're potentially sampling if test is true.

[25:17] Otherwise, we're loading the full data set.

[25:19] And we're just printing out to make sure everything is in order here.

[25:22] So when I print the data and eval set, the train and eval set, there should be around, yeah, 244 training rows and then 32 in the eval split.

[25:32] Okay, now we're going to load the model to run inference.

[25:35] This is how you load with VLLM.

[25:37] You pass in the model slug, the GPU maximum memory utilization, the data type, and the max sequence length, which we set up earlier, and our sampling parameters.

[25:47] So here, when we're generating, we're going to use a temperature that's defined by temperature here, which I think I've set to 0.7.

[25:56] Top K of 40 and top P of 0.95.

[26:00] These are important because it makes sure that very low probability tokens are not accepted.

[26:04] You don't want tokens that are very low probability because they can throw your prompt or they can throw your completion off in an unexpected direction.

[26:11] So at this point, the model is going to be loaded.

[26:15] It'll be downloaded from Hugging Face first, and then the shards will be loaded onto the GPU.

[26:20] So this is typically where you hit problems if you're going to hit problems when you're trying to load a given model.

[26:25] So here you can see the smaller files have been loaded quickly.

[26:30] And now we're going to load the safe tensors.

[26:33] So you can see there's about 7 gigabytes.

[26:36] And notice here that we are using FlashInfer.

[26:40] FlashInfer is a relatively recent library.

[26:43] It's faster than FlashAttention.

[26:46] And if you have it installed, it will be used by VLLM.

[26:50] If you don't install it, I think it won't be used by default.

[26:54] So this model, Phi4mini, relatively small, about 3 to 4 billion parameters in size, pretty fast to load the weights.

[27:02] It's then going to create CUDA graphs, so it's going to calculate forward some paths for optimally doing the computations.

[27:08] This takes a bit of time, but then it makes the inference faster because you've pre-computed what's called the graph.

[27:15] Okay, so when the model is loaded, we can run evaluation.

[27:17] To run evaluation, we need to get answers, which is straightforward.

[27:21] We just pass the question to the VLLM model.

[27:23] But then we need to evaluate that.

[27:25] And to evaluate it, we're going to need a prompt.

[27:28] Here we have an expert evaluator tasked with determining if the answer satisfies the specified evaluation criteria.

[27:37] We're telling Gemini that it will receive a question, the evaluation criteria, and then the model's answer that needs to evaluate it.

[27:46] And it's just going to score a 1 or 0, so it's either right or wrong.

[27:49] And the prompt template is to pass in the question, the eval criteria, and the model answer.

[27:54] Okay, so this is pretty much it.

[27:58] We are going to define a helper.

[28:00] This is the evaluation result.

[28:01] We want the evaluator to first give a reason and then indicate whether the model is correct or not.

[28:07] We'll then extract using a regular expression the score, whether it's 1 or 0, to determine whether it's marked correct or not.

[28:14] And we also have this regular expression to extract any thinking.

[28:17] So if you're going to evaluate the answer, the shorter the answer, the easier to evaluate.

[28:23] And so if you include all of the thinking along with the answer, it's going to be harder to evaluate.

[28:27] And for that reason, if there is any thinking, we're going to strip the thinking here.

[28:31] Then we have a function to evaluate.

[28:35] And you can see it generates an answer.

[28:38] Sorry, this evaluate answer takes in a generated answer.

[28:44] It takes in a ground truth, an evaluation criteria in the question, and it's going to strip the reasoning and then pass it for judging here.

[28:52] Then we'll call the judge LLM and we'll get the response and parse that.

[28:59] And when we want to evaluate a model, here's where we need to create the messages with the problem, generate an answer, and then pass that answer in for evaluation.

[29:08] So that's the full loop.

[29:10] And just a note here that if you want to disable thinking, you can do that for QUEN3 models by passing in this parameter here.

[29:18] But you do need to install from source, at least for now.

[29:22] So we've defined that evaluation function, and we're just going to run it on a row.

[29:28] This is the fifth row of evaluation data.

[29:31] And it looks like my API key is invalid.

[29:35] So I'm going to go back up and I'm going to reset that API key.

[29:41] And to do that, I think I just need to search for the word reset.

[29:46] Okay.

[29:48] So, by the way, you can also select OpenAI for grading as well if you wish.

[29:52] But I'm going to use Gemini.

[29:54] And let's now put in my API key.

[29:59] I'm just going to get a new key here.

[30:02] I must have pasted in a wrong key.

[30:05] Okay.

[30:06] Let's try that.

[30:07] And we'll continue on down and see if that evaluates the model for us.

[30:12] And here, VLLM has failed, and that's because I need to reload the model.

[30:18] I shouldn't have rerun that.

[30:19] I should have just gone straight to the evaluation because the model was already loaded.

[30:24] So I hit a CUDA out of memory.

[30:26] See, VLLM will often use up the full memory for when it loads a model.

[30:31] It will pre-allocate that memory.

[30:33] So it's recommended don't rerun the model loading.

[30:37] Okay.

[30:38] So it is going to take a second now because we need to reload the model with VLLM.

[30:44] So, yeah, before I restart the kernel, I'm just going to set that reset back to false

[30:50] because I should have fixed my API key.

[30:54] And now I can run all the cells.

[30:56] So, yeah, the model is reloading.

[30:59] And I think sometimes the graph can be cached, so that improves the speed for loading as well.

[31:05] And you can see for this maximum sequence length, VLLM with this GPU is capable of concurrency of 56.

[31:12] So it will automate batching for us like that.

[31:15] And here we go.

[31:17] It's now evaluated.

[31:18] The question, what's the regulation about Touch Rugby participants covering the ball with their clothing?

[31:22] And here is a generated answer.

[31:24] And here is the evaluation criteria.

[31:26] And the judge has marked it correct because it's saying that intentionally covering the ball is a foul.

[31:34] So the final evaluation is one out of one correct.

[31:37] So this was just evaluating one question.

[31:40] But what we want to do is we want to evaluate a batch of questions.

[31:45] So we're going to use batching.

[31:46] We're going to have VLLM answer multiple questions in parallel.

[31:50] And then we're going to use threading to make parallel calls to Gemini's API.

[31:55] So that's what's happening here.

[31:56] We're batch evaluating the model by building a list of conversations.

[32:01] That's a list of all the different questions.

[32:04] We're passing that in to VLLM.

[32:07] We're passing it into model.chat here.

[32:09] Again, if you want to disable thinking, that's the line to include.

[32:13] And then we're going to judge using a thread pool executor that's going to make multiple parallel requests.

[32:19] And we'll get back the final score here.

[32:21] So you can now run a short test.

[32:24] You can set test to true.

[32:25] It'll just run two rows of the data set.

[32:28] And we can see how that does.

[32:31] And actually, sorry, it's running five rows.

[32:35] So that's why it's generated five responses.

[32:38] And then it's going to judge those.

[32:40] So here's a sample generation.

[32:43] In fact, it's just giving us the five generations first.

[32:45] And then it's giving us the judging of those five.

[32:48] And you can see the results coming out here.

[32:51] And it looks like we've gotten 40% correct.

[32:55] Now we're going to evaluate the full evaluation data set.

[33:00] But because we have temperature non-zero, this is not deterministic.

[33:04] So actually, when you evaluate, you want to evaluate multiple times on that same eval data set.

[33:10] Or maybe if you had a very large eval data set, you wouldn't have to do this.

[33:14] But because my data set is 32, you'll find some variance if you just run it once.

[33:18] So I recommend running it at least three times.

[33:20] And for that, I've got a function here that allows me to run it m number of times.

[33:26] So this is for running evaluations m number of times.

[33:30] And I'm going to just copy this here because I've run it previously on Gemma 3.

[33:36] Let's just create a new cell.

[33:38] And it's going to print out the data set name, the eval split name.

[33:43] So we're running the comprehensive data set, the eval split, and we're running the phi model.

[33:48] And we should be ready to go.

[33:51] Now, one thing I don't like here is I'm still in test mode.

[33:55] So I'm getting all of the logs here.

[33:57] So actually, what I need to do is set test equals false and run it again.

[34:02] And that will just suppress all of the detail logging.

[34:07] You can see here we're going to run on all of the prompts.

[34:11] And then it's going to evaluate those using the judge.

[34:16] Now, actually, you could increase the number of threads here.

[34:20] Gemini's API is able to take much more.

[34:23] You could probably even increase it up to 128 if you wanted.

[34:26] And I think I have an issue with the kernel because I can see my GPU memory is not working.

[34:33] That may be because I just stopped it.

[34:36] During processing.

[34:38] Sometimes if you stop VLLM midway, you just run into these issues.

[34:43] So maybe I should have let it run out.

[34:46] But I didn't feel like that because it was printing too much debugging.

[34:52] So we'll run it again here.

[34:53] And just while we're waiting for that running, I'm actually running the comprehensive data set as opposed to a manual data set that I curated.

[35:02] So this should actually be moved down.

[35:06] You can see evaluation is starting there, by the way.

[35:08] And I can probably just move it in this way here just by clicking the down button.

[35:15] And we can now run that full evaluation.

[35:17] Now, this is a section for a manual data set I curated.

[35:22] But we're running the comprehensive data set.

[35:24] You can see here I ran it previously on the Mistral model.

[35:27] And if I just paste in a copy of this, I'm adding in test equals false to make sure we don't have too much debug logs.

[35:37] And you can see now we're processing the prompts.

[35:39] So 32 prompts.

[35:41] And I think I actually should have put that with a lowercase test.

[35:49] Either way, we can probably wait for it to complete and then rerun it so it doesn't print out all of the logs here.

[35:56] So it's going to print out 32 of the answers.

[36:00] And here you can see it making those parallel calls to Gemini.

[36:06] And we're running basically 96 different prompts here because we're running this three times.

[36:11] So we're running the evaluation three different times.

[36:14] And you can see here the data set, the eval split, and then the model name that we're running.

[36:20] Now, I'm just going to run this again so it doesn't print the verbose logs by setting test equals false.

[36:26] And in the meantime, we can take a look at some of the results.

[36:28] So here with Mistral small, when I ran this three times, I got an average of 13 answers correct.

[36:36] Somewhere around 40% overall.

[36:38] I can show you also, I think I ran on, here's some archived results, on the Qwen 1.7B.

[36:48] So the Qwen 1.7B, I scored five.

[36:52] So about half the amount, five instead of 13.

[36:56] And I can show you also, let's see, do I have any other results down here?

[37:01] Gemma 3, 4B.

[37:04] And Gemma 3, 4B scores nine.

[37:07] So yeah, the small Qwen model scoring about four.

[37:10] That's including reasoning, by the way.

[37:12] Sorry, five.

[37:12] Gemma 4B scoring about nine.

[37:15] Mistral scoring about 13.

[37:18] And now let's go up and see how we score with the Phi model.

[37:21] I expect, I don't know, maybe something like the Gemma 4B.

[37:25] Somewhere around nine.

[37:26] Okay, we got six.

[37:29] Then the next one, we got four.

[37:31] So that just shows you the variance.

[37:32] And that's why it's valuable to run multiple times.

[37:35] And then the last one here, we've scored six.

[37:41] So on average, we're scoring five.

[37:43] So actually, this model, the Phi mini, is not much better than the Qwen 1.7B.

[37:50] So what this does is it gives us a baseline.

[37:53] We've got 5.3 correct.

[37:56] We're now going to run fine tuning.

[37:57] And we'll come back and run this again.

[37:59] And we're going to see if we get an improvement in the performance.

[38:03] Now, improving performance is not trivial.

[38:06] It's not obvious that we will improve just by doing this fine tuning.

[38:09] I have not done augmentation on this data set.

[38:12] It's a raw set of answers that were generated by Gemini Pro, which may not match, probably

[38:17] doesn't match the kind of logits or the probability pattern of the model that we're training here.

[38:21] So I could definitely do a better job of improving this data.

[38:25] So I'm not sure we're going to improve performance here by fine tuning.

[38:27] But at least I'll be able to show you how the scripts all work.

[38:30] So we'll move on down here.

[38:32] In fact, I'm going to minimize this section on evaluation.

[38:35] And we'll move to the fine tuning section.

[38:40] Now, for running fine tuning, we're going to need to use unsloth.

[38:44] And we're going to uninstall VLLM to make sure we don't have conflicts.

[38:48] If you want to speed up the fine tuning, you can install flash attention.

[38:53] But it does give issues with QUEN.

[38:55] So that's just a little warning.

[38:58] And if you want to use it, when you load the model, you need to add this here, attention implementation.

[39:02] So I'm just going to run this cell.

[39:05] I'm not going to install flash attention for now.

[39:07] If I'd restarted the kernel, it would have gotten rid of these warnings.

[39:11] That's OK, though.

[39:11] It's going to still correctly install.

[39:15] And I'm going to now restart my kernel.

[39:16] And we should have unsloths installed.

[39:19] Now, just two troubleshooting things.

[39:22] If you find that there's a conflict with Torch Vision, you do not need Torch Vision, I think, anymore with the latest version of Transformers or unslots.

[39:29] So you could just uninstall it.

[39:31] Also, if you have issues with unslots, cut cross entropy, unslots has a custom cross entropy that saves on memory, I think, maybe on compute.

[39:39] You can disable it if there are issues by running this line here.

[39:42] OK, so I've restarted the kernel.

[39:46] And I should still be logged in.

[39:48] Actually, it looks like maybe I'm not fully logged in.

[39:53] So I'm going to go across and get a token.

[39:56] And the reason I want to be logged in here is so that I can push models up to hub.

[40:01] Or access private models.

[40:03] Now, the model that we want to train is going to be the Phi model again.

[40:08] So I need to populate that.

[40:11] And, yeah, we can set the max sequence length to 8000.

[40:16] We're going to fine tune in 16 bits, which I recommend.

[40:19] But 8-bit, I think, actually is not bad.

[40:21] So if you wanted to save memory, you could do that.

[40:23] We're going to use this data set.

[40:25] And we're going to set the name of the question column, the criteria column, and the answer column.

[40:30] So we've done that right here.

[40:31] And if you wanted, you could use a different data set for eval by setting this here to the load data set right here.

[40:43] OK, so we're going to set these variables.

[40:48] I've got this little helper function.

[40:50] This is just a function to clear CUDA.

[40:52] If you've loaded a model and you want to reload a model, you can clear out the model that's there already.

[40:57] So we just create this little helper function.

[41:00] And now we're going to load the model.

[41:02] So I'll run this here.

[41:04] And I'm going to print out the padding side.

[41:07] Typically, for fine tuning, you will want to pad on the right-hand side.

[41:13] Or you'll want to use whatever the model's default is.

[41:16] But for inference, using VLLM, you typically want left padding.

[41:20] So just a note there.

[41:22] We're going to print this out and see what happens for phi.

[41:24] Now, one slot here is downloading the model, which I do not want.

[41:31] Because the model should actually already be downloaded.

[41:34] So what I need to do is potentially set this cache directory and retry.

[41:43] And again, it's downloading the model here.

[41:47] So I'm going to restart the kernel.

[41:49] And let's just check that we have the name of the model correct.

[41:53] 5-4.

[41:55] Yeah.

[41:55] It could be that Unsloth is downloading it from its own repo.

[42:03] Because Unsloth has got a version of all these models.

[42:07] So I'm not entirely sure.

[42:10] But I suspect that may be what's happening.

[42:12] If I look at the 5-4 mini model here.

[42:15] This is the original model.

[42:18] But if I copy this, there's probably an Unsloth version.

[42:22] Yeah.

[42:23] Unsloth has got this version here.

[42:25] And it may just be defaulting to this.

[42:28] So that's why it's actually downloading the model.

[42:30] Even though I've downloaded it already for VLLM.

[42:35] That's okay, though.

[42:37] We can keep going.

[42:38] Allow it to download.

[42:41] And then we'll see what padding side is default.

[42:44] And we'll also print out the model.

[42:45] Okay.

[42:47] So pretty fast.

[42:47] Yeah.

[42:48] It's not using a padding token.

[42:50] So actually, Unsloth is automatically setting it to the end of text token.

[42:54] And it's using the left-hand side for padding.

[42:57] Which should be fine.

[42:59] And you can see the model architecture here.

[43:01] 31 layers.

[43:01] The MLPs.

[43:03] The attention.

[43:04] And actually, the attention is fused.

[43:06] So the Q, K, and V are fused together.

[43:09] Unsloth might decide to unfuse those.

[43:12] I'm not sure.

[43:13] We'll see what happens.

[43:15] So this here is a function just for me to inspect the size of these different modules.

[43:20] This is relevant because larger matrices, you should train more slowly, and this affects how we put on the adapters.

[43:28] And yeah, we actually need to adjust this code to work for phi because the layout is not the same.

[43:38] So if I go to chat GPT and I create a new conversation, I can say, update this code to support the phi architecture.

[43:51] And sorry, my screen is small there.

[43:54] I'm just going to paste in the phi architecture now.

[43:57] So I'll paste this in.

[43:59] And we'll print.

[44:02] Now, why am I going to this trouble to see the dimensions?

[44:06] I'm doing it because I want to know what I should set my LoRa Alpha to.

[44:10] And the LoRa Alpha should be the square root of the smallest matrix dimension.

[44:14] Basically, LoRa Alpha is kind of setting a bar for how you think about the relative training rate of the adapters, the LoRa adapters versus the main matrices.

[44:26] We're using LoRa, LoRa rank adapters.

[44:28] That means we're not going to fully fine tune all the weights.

[44:31] We're just going to put these little adapters that clip on to the main model, and we're going to fine tune those instead.

[44:36] But because they're smaller, you need to train them faster.

[44:38] The size of the adapters is determined by the rank.

[44:42] The larger the rank, the slower you want to train it.

[44:45] And that's actually scaled automatically when we set use RS LoRa.

[44:48] But you do need to set up this parameter that effectively is referencing the matrix size.

[44:55] Because it's all about the relative size of the adapters compared to that original matrix that we're going to freeze.

[45:01] Okay, so it's given us this unwrap function.

[45:05] And we will see if it works.

[45:09] So let's see here.

[45:12] Is it giving us both functions?

[45:13] Yes.

[45:15] So I think I can just paste that in.

[45:18] And if I run it.

[45:20] Yeah.

[45:21] So now we can see here.

[45:22] We've got this fused layer, which has got the smaller dimension of 3,000.

[45:30] And we've got the MLP, which has got 3,000.

[45:34] So actually what we want to do here is we want square root of 3,000.

[45:42] And I think that's going to work out.

[45:44] 32 is about square root of 1,000.

[45:46] So it should be about 1.7 times this.

[45:48] So something like, you know, 50 is going to be fine.

[45:51] And now we're going to get the parameter efficient fine-tuned model.

[45:56] This is where we create the adapters.

[45:58] So we're going to take the model, set the rank of 32.

[46:01] That should be fine.

[46:02] If you want more granularity, you can increase.

[46:05] And we are not going to fine-tune any vision layers.

[46:08] In fact, I don't know if PHY supports vision in any case.

[46:12] I don't think it does.

[46:14] I think it might just be text.

[46:16] But if you're loading something like JAMA 3, you want to set this false so you're not tuning it.

[46:21] Then we're going to decide to train the attention or the MLP modules.

[46:27] If you're training an MOE like JAMA, you typically do not train the MLP.

[46:31] You just train attention because MLP is going to be sparse because it will be a mixture of experts.

[46:36] So, yeah, that's basically your guidance.

[46:39] Additionally, if you want to train the embeddings, that's often relevant if you're trying to change.

[46:44] Here, for example, we've created a pad token or we've used an EOS end-of-sequence token.

[46:50] It's probably not necessary to train the embeddings.

[46:52] But if you redefine some new tokens or the purpose of those tokens, then you do need to train the embeddings.

[46:57] LoRa Alpha is passed here.

[46:59] We're going to use gradient checkpointing.

[47:02] That means we're not going to store everything on the forward pass.

[47:07] We're only going to recalculate when we do the backward pass.

[47:10] And that saves memory.

[47:11] We're not going to do full fine-tuning, although Unslot does support that.

[47:15] And we are going to automatically scale the learning rate of the adapters based on the size of the rank.

[47:21] So, we're applying now or we're creating these adapters.

[47:24] And then we're going to see how many trainable parameters we have.

[47:27] We have 0.4625.

[47:30] Now, these names here, the modules to save, should match what we have in the model up here.

[47:36] So, it should match LMHEAD and embed tokens.

[47:38] I think this is not actually setting them to trainable for phi because the embeddings are usually large and the trainable parameters would be much larger if we were actually training these.

[47:52] So, I don't think we're actually training the embeddings here.

[47:54] We're just training the lower adapters.

[47:57] I also want to see what modules are set as trainable.

[48:03] So, let's just ask here, give me a function to see what modules are set to trainable in the model.

[48:13] Because I suspect that because the QKV are fused, either Unsloth has to unfuse them or is not going to set them trainable.

[48:23] But I'm not sure about that.

[48:25] Maybe it will.

[48:27] So, let's add this here and run.

[48:32] And we do need to pass the model.

[48:34] Okay.

[48:35] So, yeah, it looks like the MLP layers are being trained.

[48:41] And that's pretty much it.

[48:45] Oh, the O projection.

[48:47] Yeah, that makes sense.

[48:48] But the QKV is not because it's fused.

[48:51] So, yeah, basically, Unsloth is not unfusing here.

[48:55] So, it means that we're only training one of the modules in the attention.

[48:58] We're not actually training the full thing.

[49:00] Okay.

[49:02] So, we've got the model.

[49:03] We've got the adapters.

[49:05] We're now going to load the data set.

[49:06] We're just loading here the fine-tuning data set.

[49:09] We can print out a sample question.

[49:11] We can print out a sample question from the eval data set.

[49:15] You can see here the training data.

[49:18] It's got a lot of columns.

[49:19] The ones we're using are the question and the answer for fine-tuning.

[49:24] And here's where we set that up into a prompt.

[49:26] So, the prompt is going to have a user message with the user content.

[49:30] And then it's going to have an assistant message with the assistant content.

[49:33] The user content is the question.

[49:36] The assistant is the answer.

[49:37] And here we're just going to format that as a template.

[49:40] So, yeah, this is what the FI templating looks like.

[49:45] We've got user.

[49:47] And then we've got the assistant.

[49:49] And notice here, we'll need this because actually we want to focus the training on the assistant response, not on the user question.

[49:57] So, later on, we're going to want to mark this here as the token for indicating the user response.

[50:04] And then we're going to want to mark this as the tokens or the string for the start of the assistant response.

[50:11] Okay.

[50:13] Now, we're going to start to set up the trainer.

[50:16] So, we need to set a training rate.

[50:18] We're going to try a batch size of 4.

[50:21] This model probably can fit a larger batch size because it's just 8B.

[50:26] I'm going to use 4 gradient accumulation steps.

[50:28] My virtual batch size is 32.

[50:30] We'll train for 2 epochs, one at a constant rate and one decaying with a cosine for some annealing.

[50:37] For a 3B model, we want about 1E minus 4 of a training rate.

[50:42] So, I'll put that in here.

[50:44] And we're going to define a current timestamp just for naming the model.

[50:52] And we're going to set up a run name based on the model name, the fine-tuning, data set name, the number of epochs, and the timestamp.

[50:58] We'll calculate the number of training steps, which is total number of rows of data, divided by the virtual batch size, which is the batch size times gradient accumulation, divided by the epochs.

[51:09] We'll warm up for 1% of the steps, and we'll anneal it for the last 50%.

[51:14] We'll print the virtual batch size, the total steps.

[51:17] And we're going to set this custom training scheduler.

[51:21] Basically, it's going to be constant here when we're below the start of annealing, and then it's going to follow, I think, either a linear or a cosine drop from there.

[51:30] Yeah, it looks like a linear drop then in learning rate down towards the end, which should hopefully smoothly bring us down towards a local minimum.

[51:38] Okay, so virtual batch size 32, 14 steps.

[51:42] Zero warm-up steps, because we don't have enough steps for 1% to be meaningful, and the annealing will start at step 5.

[51:50] Now, you could warm-up maybe a little bit more.

[51:52] For example, where do we have the warm-up?

[51:55] Yeah, 0.01.

[51:57] I mean, we could make it 0.05, and make it 5% of total steps.

[52:02] And now, we've still got 1.

[52:04] It looks like I would need to make it larger.

[52:07] I'd need to make it even, maybe 0.1.

[52:09] Yeah, so now we've got 1 training step.

[52:13] I don't like that, though, because I think 10% is too much, generally, for training, or for warm-up.

[52:17] So, I'm just going to leave it, and we'll have no warm-up.

[52:19] That's fine.

[52:20] Okay, so now we're going to set up some of the arguments.

[52:24] We'll pass the training batch size.

[52:25] We're going to use that same batch size for the evaluation batch size.

[52:29] Gradient accumulation steps.

[52:31] Epochs.

[52:32] We're going to log every number of steps.

[52:34] In fact, we're going to log every 5% of steps, but no less than every single step.

[52:41] No less than every 10 steps.

[52:43] Sorry, no more than every 10 steps, I think.

[52:46] Let's see.

[52:46] Minimum.

[52:47] Minimum of 10.

[52:49] Yeah.

[52:49] No less than every 10 steps.

[52:51] We're going to evaluate based on steps every 10 steps, but no more than every single step.

[52:58] And what else here?

[53:01] Yeah, we're going to use gradient checkpointing.

[53:03] We'll use re-entrancy.

[53:04] This allows you to speed up.

[53:07] It speeds up the calculations.

[53:08] It's a bit more complicated.

[53:09] So, sometimes can give errors, sometimes on QEN models, although it worked for me with

[53:14] QEN on QEN 3.

[53:14] So, we're going to leave that to 3, to true rather.

[53:18] And yeah, we'll pass in the max sequence length.

[53:22] And now we'll pass all of these parameters into the trainer.

[53:26] They're going to be passed in here, along with the training and eval set, the model and tokenizer,

[53:30] and the formatting function that we defined up earlier.

[53:32] Now, here's an example of where transformers is different.

[53:36] I should have showed you above, but I will in a second.

[53:38] Normally, what you would do to set the optimizer is you would set the scheduler here,

[53:44] and you would set it equal to the optimizer and the scheduler.

[53:49] But that is not possible to do with unsloth, so you have to actually retrospectively set the optimizer here and also make sure that it doesn't get overwritten by later steps of unsloth.

[54:02] So, this is necessary because I'm using a custom scheduler, so I have to make sure that scheduler is actually being applied here.

[54:09] I can maybe show you very quickly if I go to my windsurf and I look at the transformers code.

[54:18] When I go to the trainer and the optimizer, you can see it's just being passed in here.

[54:23] Optimizers equals optimizer scheduler.

[54:25] That won't work with unsloth.

[54:26] While we're at it, just one difference here as well.

[54:29] When we're loading the model with transformers, we will load it like this with auto model for causal,

[54:36] but this will not work for loading a multimodal model, so it won't work for JAMA 3.

[54:40] It might work for JAMA 3.4b because that's text only, but for the larger it won't work.

[54:46] It won't work as well for, I think, Mistral.

[54:49] You need to use a different and specific way to load it, whereas unsloth has wrapped things in a way that you can use the same loading for every model.

[54:58] So, if we look at unsloth here, when we load the model, pretty much you can pass any model and it's going to load correctly.

[55:06] just by using, let's see, this fast language model, so it actually supports multimodal models, so that's quite a nice feature.

[55:14] One other difference as well is when we do the PEFT, there's a slight difference in loading the PEFT, the parameter efficient fine-tuned model.

[55:25] Here you use fast language model, get PEFT model for unsloth, and you have these kind of wrappers that allow you at a high level

[55:32] to control vision versus language and control attention versus MLP.

[55:36] Whereas if you're looking at transformers, when you get the PEFT model, it's a little bit more raw, which is kind of beneficial because you can target specific modules to turn on.

[55:47] This is the get PEFT model, it's not fast language model, and I think also maybe this might work in the case of FI if you're using transformers.

[55:55] Okay, so we have loaded, have we loaded, yeah, we've loaded the model, we've defined all of our training.

[56:03] I need to make sure I run these cells because I'm going through explaining things to you without checking I've run everything.

[56:11] So we've now defined the trainer, and we have one more thing we need to do, which is we need to define how we're going to train on completions only.

[56:22] And we need to define this for FI, so I'm going to create this, and I want to pick FI here.

[56:35] And I want to put in the correct start and end of the chat.

[56:39] And to help me here, I can just print out the chat template.

[56:43] And I can see it's actually a bit hard to read exactly what I need to include there.

[56:52] It's probably more helpful if I inspect some of the templated text, which says user.

[56:59] So I'm going to copy all of this here, and I'm going to paste.

[57:04] And this here is going to be my instruction.

[57:08] I'll give it a second, I think I've lost connection to RunPod.

[57:12] I may just need to rerun some of the cells.

[57:14] And then the end portion is this here.

[57:20] Okay, so I'm just going to save this, and I'll download a copy just in case, so I don't want to lose my work here.

[57:28] And I think everything will be fine.

[57:31] I might just need to rerun a few of the cells.

[57:34] I can actually just restart the kernel, go back to the start of the fine-tuning here, and run all of these cells.

[57:42] And the reason why I'm doing this chat completion setup is because I just want to train on the completion part.

[57:50] And it's very nicely illustrated.

[57:52] When you run these two cells, it's going to show me an example from the training set, example 0.

[57:57] It'll show me the full training row as it's going to be passed into the trainer.

[58:02] And then this is just going to show me which part we're going to train on.

[58:05] So the loss will only be calculated for those for the assistant tokens.

[58:08] And this is typically the recommended way to train.

[58:11] So yeah, this is the full being passed in.

[58:15] And you can see that all of this part here is masked when we're training.

[58:22] So we're only going to train the loss on this last portion.

[58:24] And it looks like everything is printing fine here.

[58:29] It doesn't print out the end of text token, which I think is fine.

[58:35] It's important that it does have at least one end of sequence token being generated here.

[58:40] Okay.

[58:42] So next, we're going to start the training.

[58:46] We'll print out the stats.

[58:48] We will make sure we have the data sets and start the training.

[58:52] It looks like we've an error.

[58:54] So what's happening here?

[58:58] So I'm not entirely sure what the issue is here.

[59:01] I can check with chat GPT.

[59:03] My inclination is that we may turn off the re-entrancy possibly.

[59:10] Well, let's see.

[59:11] And possibly using all three, by the way, would be a better idea.

[59:16] Yeah, it wants us to disable Torch Compile.

[59:18] I could just try disabling the Torch Compile, but it's not obvious where exactly I would disable that.

[59:28] Yeah, because I'd have to get into the unsloth code if I want to do that.

[59:33] So yeah, we might have an issue here.

[59:37] Can I somehow disable compiling like this?

[59:42] Yeah, I'm not entirely sure.

[59:44] It may be hallucinating here.

[59:46] But let's go back.

[59:48] And when the model is imported, we need to make sure we do this right at the start of the script.

[59:53] So when we import the OS, we're going to disable if trying to tune phi for mini.

[1:00:03] I'll restart the kernel and try to run this.

[1:00:11] I'll comment that out so it doesn't spam everyone.

[1:00:14] And we'll see if this works.

[1:00:16] Maybe it won't, and we'll just go to another model.

[1:00:18] But yeah, this is an example where you might want to use the Transformers script to get things to work.

[1:00:25] Okay, so that did work.

[1:00:27] We managed to disable compile.

[1:00:31] And now we're training.

[1:00:32] And the loss is looking good.

[1:00:35] You see the training loss is falling.

[1:00:36] The validation loss is falling as well.

[1:00:38] So everything is looking pretty good here.

[1:00:40] And we're going to save this.

[1:00:42] Let me get a model name that I want to save with.

[1:00:47] I want a better run name than what we have.

[1:00:52] So it's going to save it like this.

[1:00:55] TouchRugby.

[1:00:59] And let's just put that in as a name here.

[1:01:02] Okay, so we've got phi4.

[1:01:05] We'll name it.

[1:01:06] And yeah, I mean, we could push it to hub.

[1:01:09] Why don't we do that?

[1:01:11] See if this works.

[1:01:13] And run this.

[1:01:15] Org is not defined.

[1:01:17] I don't know why I uncopied that there.

[1:01:19] There we go.

[1:01:22] And print the run name.

[1:01:24] And in the meantime, let's check out the logging.

[1:01:27] Let's check the logs.

[1:01:29] So to check the logs, it's easiest to connect via SSH.

[1:01:32] I'll just copy this here.

[1:01:34] Go over to Windsurf.

[1:01:35] And then in my terminal, I'm going to SSH.

[1:01:39] I actually need, this is my SSH file, which is in, if you go to .ssh directory on your computer, you should find one.

[1:01:50] But you need to create an SSH key and put the public key into RunPod.

[1:01:56] And then you should be able to connect.

[1:01:57] And once you've connected, you can then start up TensorBoard.

[1:02:05] So yeah, pip install UV, install TensorBoard, move to the workspace, and then run in order to see the logs.

[1:02:14] And if this works and it's up and running, we should then be able to access it via the RunPod URL.

[1:02:21] Yeah, so it's up and running.

[1:02:23] I think I could click this, could I?

[1:02:25] Don't think this is going to bring me to the right page, though, because this will just bring me to localhost, but that's not going to be accessible.

[1:02:34] I need instead to go to my RunPod, pod ID here.

[1:02:38] And actually, the address I need to go to depends on the pod ID, which is going to be this.

[1:02:45] So paste here, copy.

[1:02:48] And now I can check TensorBoard.

[1:02:53] So basically, I'm porting in, because RunPod allows me to port into this.

[1:02:58] And the run we just did, two of these failed.

[1:03:01] So I'll just show the ones that passed.

[1:03:04] And the eval loss looks beautiful.

[1:03:06] Take off smoothing.

[1:03:07] It's falling.

[1:03:09] And you can see it's kind of asymptoting.

[1:03:11] So we're kind of getting down to the best point there.

[1:03:12] The gradient norm is a bit high at the start, but it's good.

[1:03:15] It's below 1 then.

[1:03:16] The learning rate is flat, then declining.

[1:03:19] This looks good.

[1:03:20] And the training loss is flat and declining.

[1:03:22] So this is all excellent.

[1:03:23] So everything looks great in terms of these curves here.

[1:03:28] And now if we go back, we should have by now pushed the model.

[1:03:34] So the model's been pushed.

[1:03:35] That's excellent.

[1:03:35] And it's now time to inference this model.

[1:03:39] So I'm going to go all the way back up to the script.

[1:03:42] And I'm going to restart the kernel.

[1:03:45] And I'll close down this fine-tuning section and reopen the val.

[1:03:50] And I'm going to run these installs.

[1:03:52] It should be fast.

[1:03:53] But actually, we should uninstall unslot this time.

[1:03:56] Because it can cause some conflicts.

[1:03:59] So let's uninstall that.

[1:04:04] Make sure we're logged in to HuggingFace.

[1:04:06] And this time, we're going to use the fine-tuned model.

[1:04:09] Model slug.

[1:04:10] And fine-tune.

[1:04:13] Set up the judge.

[1:04:16] I won't have to re-enter my key because it's saved to environment variables.

[1:04:19] Load the model.

[1:04:21] Set up evaluation.

[1:04:23] Run that eval.

[1:04:24] We'll inspect that later.

[1:04:25] Run batching.

[1:04:27] And yeah.

[1:04:29] I'm just going to put test equals false here so that we don't accidentally leave testing on.

[1:04:35] And it looks like we have an error.

[1:04:38] So something must have failed earlier here.

[1:04:41] Yeah.

[1:04:43] That actually needs to be Trelis.

[1:04:47] So my org ID, when I put the model in, should be this.

[1:04:54] The other thing I should maybe have done is saved it locally so I wouldn't have to re-download.

[1:04:59] But that's okay.

[1:05:00] So yeah.

[1:05:02] It's going to probably have to download the model.

[1:05:03] Which is a bit of a duplication, but that's all right.

[1:05:07] Yeah.

[1:05:08] And when you do install, you need to restart the kernel.

[1:05:13] So I forgot about that.

[1:05:14] So yeah.

[1:05:15] I re-ran the install.

[1:05:16] Then you restart the kernel.

[1:05:18] And that's why it's a bit tedious swapping between VLLM and Unswell Author Transformers.

[1:05:23] But on the other hand, the eval is going to be really fast.

[1:05:26] And yeah.

[1:05:27] We're running with VLLM here.

[1:05:29] But we have an issue.

[1:05:30] So I'm just going to copy that code.

[1:05:34] Go back to the old trustee here.

[1:05:37] And see what it says.

[1:05:39] Again, I'm using 4.0 here.

[1:05:41] I should probably be using 0.3.

[1:05:43] But let's see what happens.

[1:05:44] Yeah.

[1:05:46] So I'm not going to be able to override it like this.

[1:05:49] And this is probably the same issue that is happening with Gemma.

[1:05:53] With Gemma 3.

[1:05:54] Basically, the configuration of the model is not matching what VLLM expects.

[1:06:02] And so what we can do is go back to find the model name, which is this one here.

[1:06:09] Find it on Hugging Face.

[1:06:12] And then also, if I go to Hugging Face.

[1:06:16] And then, if I go to the configuration file.

[1:06:18] And then, if I go to the configuration file, I'm going to go to the configuration file.

[1:06:22] And then, if I go to the configuration file, I'm going to go to the configuration file.

[1:06:26] And I'm going to go to the configuration file.

[1:06:26] So, yeah, that's the configuration file.

[1:06:27] And let's see the configuration file here in the FI original.

[1:06:34] And does this look the same?

[1:06:39] Okay.

[1:06:41] Check out the right models, which I do.

[1:06:43] So, everything here looks very similar.

[1:06:49] Interestingly, the tokenizer is a bit different.

[1:06:51] And the auto model config is a little bit different.

[1:06:57] So, the auto map is maybe a little bit different.

[1:07:00] And, yeah, the architecture is different as well.

[1:07:03] So, I wonder if I take this here.

[1:07:06] And if I paste that in.

[1:07:10] If that's going to help.

[1:07:12] So, I'll copy this.

[1:07:14] Edit.

[1:07:15] Just check in case there's much different down here.

[1:07:19] Yeah, there's also this.

[1:07:20] Okay, that looks similar.

[1:07:23] This looks similar as well.

[1:07:26] So, let's just take this.

[1:07:28] Edit the file.

[1:07:31] And replace from here.

[1:07:33] And I'm just going to copy this over.

[1:07:36] I'm going to save that just in my notes.

[1:07:39] Just in case I want to re-inject it later.

[1:07:43] But for now, let's just match what the original is.

[1:07:49] And this is the original we want to match.

[1:07:51] And we'll commit those changes.

[1:07:53] Okay.

[1:07:55] So, let's see if that does anything.

[1:07:58] It may or may not.

[1:07:59] We may not get to the bottom of this.

[1:08:00] Restart the kernel.

[1:08:03] And let's try and run it.

[1:08:06] And as I said, it's not a guarantee at all that this is going to work.

[1:08:11] If it doesn't work, I'll show you eval with another model.

[1:08:14] In fact, you've already seen eval.

[1:08:16] So, this is really just a question of whether we can get it working.

[1:08:20] It doesn't look good here.

[1:08:22] Yeah.

[1:08:22] It doesn't look good here.

[1:08:23] So, this is an example of where you may decide it's worth running with transformers.

[1:08:33] And what I'll do is I'll just save this here.

[1:08:37] I'm going to rename it.

[1:08:41] And I'll put it as 5.4 mini.

[1:08:43] And I'll download this.

[1:08:46] And I will put it, for those who have access to the repo, I'll save it so that you are able to take a look.

[1:08:55] I'll put it into the fine-tuning folder here.

[1:08:57] And I'll push it up.

[1:08:58] But, for now, what I will just, I'll just quickly show you using the transformers script.

[1:09:04] Because transformers should work here, given it's a text-only model.

[1:09:08] So, if I go to fine-tuning, fine-tune.

[1:09:11] And if I upload the transformer script here.

[1:09:14] We can probably run a pretty fast fine-tune.

[1:09:18] Starting off with the fine-tune.

[1:09:22] We won't even run the eval first.

[1:09:24] We'll just run the eval afterwards.

[1:09:26] So, let's just run through, very quickly, the transformer script.

[1:09:34] And we will make use of variables where we have to, like this.

[1:09:41] So, we'll use this as the base model.

[1:09:45] No, we're not going to use the fine-tuned one.

[1:09:47] We're going to use this one here.

[1:09:49] So, the 5.4 mini instruct.

[1:09:53] Everything else is pretty much the same.

[1:09:55] We're going to load the model.

[1:09:58] I don't actually think I want to set the padding to right.

[1:10:04] I'm just going to print tokenizer padding side.

[1:10:07] And yeah, we may actually need to manually set the token here because onslaught normally does this.

[1:10:16] And that's probably pad token is equal to a tokenizer eos token.

[1:10:23] And we'll print tokenizer pad token.

[1:10:28] So, yeah.

[1:10:31] I'm actually going to clear the GPU and reload it.

[1:10:34] And yeah, you can see that's interesting.

[1:10:39] So, onslaught seems to be setting the padding side to left.

[1:10:43] We can check it for when we run it here.

[1:10:46] This is evaluation.

[1:10:48] But let's just check in the fine-tuning.

[1:10:51] Yeah.

[1:10:52] Onslaught is setting it to left, whereas the default is actually right.

[1:11:00] So, I'm going to leave it to right.

[1:11:01] Print the padding token.

[1:11:04] I have set the pad token here because I think if I print the pad token, let's just do this.

[1:11:12] Pad token before setting to eos.

[1:11:17] And yeah, I'll just reload it once more.

[1:11:19] So, pad token before setting to eos seems to be the eos token.

[1:11:25] And that implies we don't need to do this.

[1:11:29] So, yeah.

[1:11:30] I'm not sure what approach onslaught was taken, but it seems like there is actually a pad token.

[1:11:37] And so, we don't need to set it.

[1:11:39] Okay, fine.

[1:11:40] We'll print the model here.

[1:11:43] Unwrap to base.

[1:11:45] We actually need to update that because we want to be able to unwrap the phi model.

[1:11:50] And yep, we've got QKV.

[1:11:55] And we're going to increase the lower alpha here a little bit because these are fairly large matrices.

[1:12:01] And yeah, we're going to target the Q.

[1:12:04] We're going to target the O.

[1:12:06] We're not going to target QKV.

[1:12:08] We're just going to target O.

[1:12:09] And we are going to try and target gate upproj and downproj.

[1:12:15] So, we want to add in here gate upproj.

[1:12:20] And we want downproj as well.

[1:12:26] So, something like this.

[1:12:28] Now, we should check just to make sure that we are training everything possible.

[1:12:32] Yeah, gate upproj.

[1:12:34] This is probably also a combination, which I don't love.

[1:12:40] Let's see when we fine-tuned here.

[1:12:42] Oproj, downproj.

[1:12:44] Yeah, it looks like onslaught is separating out the downproj, downprojection, which is good.

[1:12:51] Whereas, I'm not going to be.

[1:12:53] Yeah, the down, sorry, down.

[1:12:55] Yeah, it's not training gate up.

[1:12:56] So, we're actually only training limited numbers of layers.

[1:13:00] So, when I train, select here what to train, gate upproj is not going to be trainable because that's fused.

[1:13:10] Well, it might be, but that's not what I'm going to do.

[1:13:14] And, yeah, so I'm just going to train these ones for phi.

[1:13:17] And let's see if, yeah, so you can see the difference.

[1:13:21] This is actually working here in Transformers, so it is training the embeddings, and that's why we're training 24% of the parameters.

[1:13:28] So, yeah, in onslaught, this doesn't seem to be actually working for now.

[1:13:32] Okay, we load the datasets, the formatting function, and yes.

[1:13:40] So, there's a difference here.

[1:13:43] In onslaught, you'll remember that after we set the trainer, we then adjusted the mask so that we only train on completion tokens.

[1:13:52] And with Transformers, the way I have it set up, I'm actually doing that beforehand.

[1:13:56] And you can see I need to copy over this code here so that we can select phi.

[1:14:05] And now I have a mistake here.

[1:14:10] So, I need to possibly reload my training data.

[1:14:14] Yeah, so this is equivalent.

[1:14:17] You can see I'm masking everything up until the end of the user response, and then I want to keep and train on the response here.

[1:14:30] So, everything looks good here.

[1:14:35] It's just that I'm tokenizing and masking my dataset before I pass it into the trainer, whereas with onslaught,

[1:14:44] you can tocalize afterwards because onslaught has got a built-in function.

[1:14:48] So, this is essentially equivalent.

[1:14:52] We're just going to train on this portion here.

[1:14:54] And now I need to make a few adjustments.

[1:14:57] Training batch size of 8, 4, 2 epochs, training rate of 1, e-4.

[1:15:06] And we will run with the valuation.

[1:15:10] Everything here is the same.

[1:15:11] Technically, this notebook supports distillation as well, although I haven't run it recently.

[1:15:16] You can check out the distillation video if you want.

[1:15:18] And we're going to now move to training.

[1:15:21] That would be interesting.

[1:15:23] Yeah, we don't have any issue with Torch Compile.

[1:15:25] I don't know if that's because Transformers doesn't use Torch Compile.

[1:15:30] But everything looks to be training fine here.

[1:15:34] Probably our logging, we should be logging more frequently instead of just logging every five steps.

[1:15:41] But that's okay.

[1:15:42] It won't make a difference to the results.

[1:15:45] And we are going to push this model up to Hub and hope for some better results than what we got with Unsloth.

[1:15:51] So, I'm going to copy this run name here, paste it in here, comment.

[1:15:57] And we are going to merge the model, save it, and push it up.

[1:16:07] And, by the way, this time we are saving it locally, so I should be able to just run the model locally.

[1:16:13] And while that's working, I will just get ready to run a valuation here by putting in my model name.

[1:16:20] So, the dataset we're going to run is this, or rather the model we're going to run is this one.

[1:16:26] And I need to be careful not to run that until it's actually pushed.

[1:16:31] Okay, I'll just give it a moment to push that model to Hub.

[1:16:36] Yeah, just while we're waiting for that, we can check out TensorBoard and refresh.

[1:16:40] And you can see the two runs here.

[1:16:42] And, yeah, it's interesting.

[1:16:45] The training and learning rate are the same.

[1:16:49] The grad norm is a little bit higher for transformers.

[1:16:52] And the eval loss, we're not logging as frequently, so it's hard to exactly compare.

[1:17:00] But it does look like the eval loss is a little bit higher for using transformers.

[1:17:06] And that's not something I would expect.

[1:17:09] Because I would think that they are basically doing the same thing.

[1:17:15] And I think we have the same batch size.

[1:17:16] Possibly, oh yeah, we're training embeddings.

[1:17:21] So, actually, this makes sense.

[1:17:23] Because we're training embeddings, we are able to control more parameters.

[1:17:26] And, yeah, you would expect that would maybe help to give a lower loss.

[1:17:32] But I guess it isn't helping here.

[1:17:33] It's hard to say too much because this is just one run.

[1:17:36] But that is the main difference between the two runs is that we're training the embeddings in one.

[1:17:41] And maybe it's better not to train the embeddings.

[1:17:43] It's maybe a tentative conclusion here.

[1:17:45] So, the model is pushed.

[1:17:46] We're going to restart the kernel.

[1:17:48] Kernel, restart kernel.

[1:17:50] And now we will close down the fine-tuning section.

[1:17:55] Scroll to the top of the evals.

[1:17:58] Make sure we run the installs.

[1:18:02] And then we're going to run all of the evaluations.

[1:18:06] I'll restart the kernel after doing the installs.

[1:18:10] And now proceed to evaluate.

[1:18:12] So, we're hoping to have more luck this time.

[1:18:15] Prepare the data set.

[1:18:18] And let's see if the model will load.

[1:18:20] If it does load, we should get out some answers.

[1:18:25] Okay, so, yeah, this time everything is loading correctly.

[1:18:28] So, we don't have the same issue as we did with unsloth.

[1:18:31] And you can see, unfortunately, I wasn't able to fix the configuration file for making the unslot trained model work with VLLM.

[1:18:38] But this eval is going to work.

[1:18:40] So, that's good news.

[1:18:41] We'll continue running these cells.

[1:18:44] Set up the batch evaluation.

[1:18:45] And then we will evaluate on this comprehensive Touch Rugby set.

[1:18:53] And let's see what happens here.

[1:18:57] Just make sure that test is equal to false.

[1:19:00] And we're trying to beat our baseline score, which was about 5 or 6, if I remember correctly.

[1:19:07] Let me just open up and see what it was we got.

[1:19:10] So, that's the Mistral small results.

[1:19:15] And we want the Phi results, which is this one.

[1:19:18] So, we've got to beat 5.33.

[1:19:21] As I said, I'm not sure we will, just because I haven't spent a lot of time on Dataprep and augmenting.

[1:19:28] So, we may or may not do better.

[1:19:30] And 7.

[1:19:32] Okay, so we're doing better.

[1:19:33] So, we're up from...

[1:19:35] That's just one, but let's see if we run it again.

[1:19:37] 8.

[1:19:38] 7.

[1:19:39] Okay.

[1:19:40] So, definitely the fine-tuning has improved.

[1:19:42] The results here.

[1:19:43] We've gone from 7.3 with a variance of 0.44 up from 5.3.

[1:19:49] So, we are seeing a positive effect here of fine-tuning.

[1:19:52] We're, of course, a long way off getting all of the answers correct.

[1:19:54] So, there's quite a bit of work to do in terms of improving this model's performance up to where we would want it.

[1:20:02] But you can see how we ran the eval, we ran the fine-tuning, and we ran the eval again, and managed to get everything working.

[1:20:10] So, I'm going to save the latest copy of this file here, and I'll download it, and it'll be uploaded to the fine-tuning folder for those of you who want to run with the PHY model.

[1:20:23] All right, folks, that's it.

[1:20:25] I ended up doing quite a detailed and realistic run-through of the problems you go into when you're trying to fine-tune.

[1:20:32] Hopefully, you have more appreciation for transformers versus unsloth, and also the benefits of doing the VLLM approach for evaluation.

[1:20:40] It really is a lot faster than having to wait for results if you're going to run inference with unsloth or transformers.

[1:20:46] The scripts are in the advanced fine-tuning repo.

[1:20:49] You can purchase access by going to trials.com forward slash advanced dash fine-tuning.

[1:20:54] And I do plan on building on this further to help fine-tune for reasoning, particularly in verbal, which are non-quantitative type applications, where it's more difficult to do the reasoning type fine-tuning.

[1:21:06] As usual, let me know if you have any questions below in the comments.

[1:21:10] Cheers, folks.

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.