LLAMA-3 🦙: EASIET WAY To FINE-TUNE ON YOUR DATA 🙌

0h 15m video Transcribed Jun 30, 2026

101.3K

Views

1.6K

Likes

102

Comments

65

Dislikes

1.6%

📊 Average

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Fine-Tune LLaMA 3: Your Own Model!

45s

High curiosity and educational value: viewers want to know how to customize a powerful open-source model.

▶ Play Clip

Unsloth: 30x Faster Training?

60s

Controversial speed claim sparks interest and debate among AI enthusiasts.

▶ Play Clip

Format Your Data Like a Pro

60s

Practical, step-by-step guidance on data formatting is highly valuable for learners.

▶ Play Clip

Training on a T4 GPU: Only 6GB VRAM!

60s

Impressive optimization demo shows fine-tuning is accessible even with limited hardware.

▶ Play Clip

Save & Use Your Fine-Tuned Model

60s

Clear, actionable instructions for deployment satisfy viewers' desire to implement what they learned.

▶ Play Clip

Full Transcript

Download .txt Download .md

[00:00] Lama tree is an amazing open weight model, but you know what's better than Lama tree. Your own fine tune version of Lama tree. If you want to fine tune Lama tree and all your own dataset, you have a number of options.

[00:13] For example, you can do that using auto train. If you want more advanced features, you can use X a lot. Lama factory is another amazing option. And then you have onslaught, which promise up to 30 times faster training on their paid

[00:30] version. I'll be creating a series of videos on how to fine tune Lama tree using this different tools, but for this video, we are going to start with onslaught. Okay, so we're going to be using their official notebook.

[00:45] It's probably one of the best out there because it covers everything and 10 in a very user-friendly way. You can just walk you through the notebook. You can actually fine tune a lot of other models, not just Lama tree.

[01:00] And we're going to look at some of the options. Okay, so first you need to install all the required packages. So you can run this on your local machine as well. It doesn't have to be a good collab book, but for that, you'll need to have an Nvidia chip.

[01:14] I don't think it has support for Apple Silicon yet. So essentially here, we're just cloning the GitHub repo of onslaught and then depending on the type of hardware you have, it will install different types of packages.

[01:29] Okay, now next we need to just set up some training parameters. So we first need to import the fast language model class from onslaught.

[01:42] Then you need to set up your max sequence length, Lama tree out of the box supports up to 8,000 tokens, but the dataset that we're using is relatively short text. So we are just using 2048 tokens.

[01:56] Now the data types, this will automatically detect if you set it to no and we are going to be using 4-bit quantization. Now under the hood onslaught uses Lora adopters to do efficient fine tuning.

[02:10] So there are two options, either you can use the version that is available on onslaught's hugging phase repo. So here they have already loaded a few models including the latest Lama tree, the Gemma model,

[02:24] the mistro 7b, these have already the Lora adopters merged to the model. So if you are using one of these, you don't need to do anything else, but if you, let's

[02:37] say, want to use one of the models from hugging phase, then you need to provide your hugging phase token ID in case if it's a gated model. So for example, for the Meta Lama 3 version, you will actually need to accept the terms of

[02:51] services and then you can just provide that. But if you use another hugging phase model, you will actually need to add Lora adopters to that model and I'll show you how to do that.

[03:03] Okay, so this section is only if you actually need to add your Lora adopters. As I said, the onslaught version already has the Lora adopters merged with the model, so you don't really need to do this step.

[03:16] But if you are using another model, just you need to define these different parameters or uncomment this section of the code and it will work pretty fine. Okay, now since you are bringing in your own data for training, so you actually need

[03:31] to format your training set. So for this example, they're using a clean version of the original Alpaca data set. So let's try to understand how this data set is structured. Okay, so the data set has three different columns.

[03:46] First one is the instruction, so this is basically the instruction going into the model, then the corresponding user input and then the output from the model. So if you were to structure your data, it needs to be structured in exactly the same way.

[04:00] You need to have instructions, then another column for input and then a column for output. Now, in this case, if you notice that in some of the cases, the input is missing, which is fine, because the instruction just tells the model what exactly the output is supposed

[04:15] to be. Right, so again, if you are formatting your own data set, make sure to follow this structure. Okay, so the rest is very simple. First we need to download the data.

[04:27] So in this case, we are downloading the data set from hugging phase. And after that, we need to map it to this format. Right, so basically, this is going to be a text string, which will take the instruction

[04:39] section of the data set and put it here. So there's going to be these special tokens for instruction, then special tokens for input. Your input will follow these special tokens and then response from the model will follow the

[04:55] special tokens for response. And this is exactly what we are doing in here. Okay, so again, we are just creating a single column where we are transforming these three different columns into this format.

[05:10] And this is one of the crucial parts. Now this is the standard alpaca data set. There are some other prompt templates as well. For example, one of the more famous ones right now is chat ml.

[05:22] It was introduced by OpenAI, so you can structure those in here if you want, right? But make sure to properly format your functions or your input examples, because that's going to be fed into the LLM for training.

[05:36] Alright, so once we do that, next we are going to set up an SFT trainer, so supervised fine tuning trainer from hugging phase. This is based on the hugging phase transformer library.

[05:48] So the SFT trainer is going to accept. The model object, this is the unstocked specific model object. Then the corresponding tokenizer, then the data set, and then we need to tell which column

[06:02] to use, right? So everything is structured in this text column. That's why I need to specify which column to use. Then max sequence length, right? Some other parameters for controlling how the training is going to be performed, including

[06:16] what optimizes to use, what is going to be the VEDDK learning rate schedule, right? So basically if you want to change your learning rate as the training evolves, this is

[06:28] actually a really good idea. There are multiple options that you can use, right? And then you need to define your output directory where you're going to store the model. If you have seen some of my previous videos on training and fine tuning other things,

[06:42] you're probably familiar with most of these options. Now one of the places where unstocked shines compared to the other packages for training is it's optimized memory usage as well as speed.

[06:56] So in this case, if you see the GPU that we are using is a T4 GPU, which is a free GPU available on Google Colab, and this is just using about six gigabytes of VRAM, whereas we

[07:11] have a total of 15 gigabytes of VRAM. Now this does goes up during training, but it's actually very well optimized if you look at it. All right, so our training object is set.

[07:24] So we need to just call the trainer train function on the trainer object, right? And here we can see that in the initial steps, the training was starts decreasing, right? And it gradually decreases.

[07:36] There are some jumps here and there, but it has a pretty nice decrease. So that means that the model is learning. Now we could play around with the learning rate as well, along with the batch size that will actually help us converge it easier.

[07:51] Another thing which I wanted to actually highlight was this thing. So you're not even running this for a whole epoch. So we're not showing actually the data, the whole data set, we're just using a smaller subset

[08:03] of it. So we're just showing it a max of 60 steps, right? So you definitely, if you want the model to learn better, you want to run it at least for an epoch or two, or at least more steps in here, right?

[08:15] But for this quick example, we just want to see whether the training actually learns anything or not, right? And it kind of shows that there is some learning that is going on.

[08:29] Okay, so let's look at some of the stats. So it took about eight minutes for training, but we ran it for just 60 steps. If you want the model to learn for longer, and I think actually learned from the data,

[08:42] you definitely want to run it for longer, right? Now the peak memory that was reserved for this training run is around nine gigabytes out of the 15, and it only used about four gigabytes during training.

[08:57] So this is pretty impressive, right? Okay. So let's say once you train the model, how would you do inference? So un-slot offers a very simple interface for that. So you just need to use the fast language model class from un-slot.

[09:12] And then you provide your model that you just train and tell it that you want to do inference on top of it, right? Then we will need to tokenize our input. Now, since we were using the Alpaca format, we also need to tokenize it using the Alpaca format.

[09:28] So the first input is going to become instruction. The second input is going to be the actual input to the model. And then the model is supposed to generate responses and we move everything to the GPU.

[09:41] So that we can use the available GPU cores to generate a response, right? And then the rest is very simple to what you do with hugging his transformer package. We call the generate function, provide the tokenized inputs.

[09:55] Then how many max number of sequence tokens that you want to generate, right? And whether you want to use cashier or not, right? So for this input, continue the FABU 90 sequence. And we are providing one, one, two, three, five, eight.

[10:10] So here's what the model actually receives. Below is an instruction that describes a task paired with an input that provides further context, right response that are properly completes the request.

[10:22] So this is basically the system instruction that is going in. Then here's the actual instruction, continue the FABU 90 sequence. Here's the input and here's the model response. Now, to be frank, the model without even training could generate the similar output to what we are providing in here.

[10:41] But this does shows that it is actually following this alpaca format. So that means it is actually learning something during training. Now you do, you can do the same thing if you want to stream the text, but in that case,

[10:54] you will just need to use the text streamer class. And then if you run this, it will generate a streaming response. Okay, so once the model is trained, you definitely want to save it somewhere.

[11:06] So you have a couple of options either you can push it to a hugging face hub, or you can save it locally. In both these cases, it will just save the lower adopters, not will merge, it's not going to merge with the model.

[11:23] So if you were to push it to hub, then you can use model.push to hub, but in that case, you need to provide your hugging face token. Okay, so now if you want to load the lower adopters,

[11:36] we just saved for infants, then just set this to true. And this will basically load the lower adopters and will merge it with your model. And then you can start using that for infants.

[11:48] So for example, here, although like we are actually using the model that was just trained, the lower adopters already merged to it. Here was another input, what is famous, what is a famous tall tower in Paris?

[12:01] So again, this is kind of the system instruction that goes in there. Here is the extra instruction that the user was just asking. And after that, the model is generous, the response with states. One of the most famous tall towers in Paris is the Eiffel Tower.

[12:16] And then it kind of goes into the lot of details of how this was constructed, right? Now a really nice thing about onslaught is that you don't actually need unslaught to do infants.

[12:30] You can use a number of other options. So for example, once the model is trained, you can use the auto model for causal at them. This is basically the path to version to actually do infants.

[12:43] So just like what we would do based on a base hugging face transform model, right? You can use exactly the same classes, but according to the unslaught authors, they say that it's going to be much slower.

[12:58] If you use this class, rather than using the unslaught, a specific class. So you definitely want to make sure that you use onslaught for infants as well. But if you want to let's say do infants using VLLM, it does have support for that.

[13:14] But you can save it in float 16 directly. And that way you will be able to use the model with VLLM. Similarly, another amazing feature is that you can directly convert the model to GGUF

[13:28] for using with Lama CPP or Olamma. And it's way easy, right? So all you need to do is just save the tokenizer. And then when you're saving the model, you need to define the quantization method.

[13:41] So here, for example, the quantization method is 16 bits. If you don't define any quantization method, but you ask it to save it as GGUF file by default, it will be saving it in 8 bit.

[13:53] And you can also define a full bit train. So this was a quick rundown of how you can train or find the latest Lama 3 model on your own data set. Using the amazing unslaught package.

[14:08] If you haven't seen it before, I'll highly recommend to actually check it out. It is one of the best option if you are constrained on GPU. Because even in this case, it was just using under 60% of the resources

[14:22] that are available on a TV, a T4 fee instance. So this is pretty amazing implementation. They had to write the kernels themselves to optimize it. And I think more and more optimizations are coming in.

[14:35] Now, if you are interested in no code platforms, I'll recommend to use auto train. I'll be actually making another video on that. That is another option if you don't want to look at all the code and try to run e2block individually.

[14:52] But it's great to see that on day zero, not only we have different packages that supports how to do inference on Lama 3, but you can also fine tune them.

[15:04] I hope you found this video useful. If you're running to any issues or you have any questions, make sure to put them in the comment section below. Thanks for watching and as always, see you in the next one.