TubeSum ← Transcribe a video

Fine-Tuning Llama 3 on a Custom Dataset: Training LLM for a RAG Q&A Use Case on a Single GPU

Transcribed Jun 16, 2026 Watch on YouTube ↗
Intermediate 12 min read For: Machine learning engineers and data scientists with basic experience in transformers and fine-tuning, interested in practical LLM customization.
55.2K
Views
1.1K
Likes
40
Comments
46
Dislikes
2.0%
📊 Average

AI Summary

This video demonstrates how to fine-tune a Llama 3 8B instruct model on a custom financial dataset for a RAG Q&A application, all on a single GPU. The process covers dataset preparation, LoRA adapter setup, training, evaluation, and model deployment to Hugging Face Hub.

[0:55]
Fine-tuning process overview

Steps include building a dataset from custom prompts, evaluating base model performance, setting up a LoRA adapter, training, evaluating on a test set, merging adapters, and pushing to Hugging Face Hub.

[2:21]
Dataset: Financial Q&A 10K

Uses the Financial Q&A 10K dataset from Hugging Face, containing ~7,000 examples with question, context, and answer columns. The dataset is financial in nature.

[2:58]
Base model: Llama 3 8B Instruct

Uses Meta AI's Llama 3 8B instruct model, quantized to 4-bit for single GPU training. It has an 8K token context length and a built-in chat template.

[3:56]
Hardware and libraries

Fine-tuning performed on a T4 GPU (16GB VRAM). Libraries used: PyTorch, Transformers, Datasets, Accelerate, bitsandbytes, PEFT, TRL (SFTTrainer), and evaluate.

[5:30]
Model loading and tokenizer setup

Model loaded in 4-bit using NF4 quantization. A padding token is added to the tokenizer to prevent generation issues during training.

[8:19]
Custom dataset creation

Converts the Hugging Face dataset to a DataFrame, formats examples using the chat template, counts tokens, filters out examples >512 tokens, and splits into train/validation/test sets.

[14:33]
Base model evaluation before fine-tuning

Tests the base model on 100 test examples. The base model is verbose and often produces bullet points or extra text, not matching the concise answers in the dataset.

[17:36]
Data collator for completion-only loss

Uses a data collator that masks input tokens with -100 so loss is calculated only on the generated completion, not the prompt. This speeds up training and improves results.

[19:05]
LoRA configuration

Targets all linear layers in the Llama architecture (query, key, value, MLP). LoRA rank=32, alpha=16. Only 1.34% of parameters (~84 million) are trained.

[23:08]
Training configuration

Max tokens=512, 1 epoch, batch size=2 with gradient accumulation steps=4 (effective batch size=8). Uses 8-bit AdamW optimizer, evaluates every 20% of training, warmup ratio=10%.

[26:32]
Model saving and merging

After training, the LoRA adapter is merged with the base model on a P100 GPU (T4 memory insufficient). The merged model is pushed to Hugging Face Hub as 'Llama-3-8B-Instruct-Finance-R'.

[28:45]
Comparison: fine-tuned vs base model

The fine-tuned model produces much shorter, more concise answers that match the dataset format. The base model remains verbose and adds unnecessary formatting.

Fine-tuning Llama 3 8B on a custom financial dataset significantly improves response quality, making it more concise and aligned with the desired format. The process is feasible on a single GPU using LoRA and 4-bit quantization.

Mentioned in this Video

Tutorial Checklist

1 0:55 Build a dataset from custom prompts (e.g., JSON file) and transform into Hugging Face dataset.
2 1:14 Choose and evaluate the initial performance of the base model (Llama 3 8B Instruct).
3 1:23 Set up a LoRA adapter for parameter-efficient fine-tuning.
4 1:46 Train the model and monitor the training process (approx. 2 hours on a single GPU).
5 1:58 Evaluate the fine-tuned model on a previously created test set.
6 2:06 Merge the LoRA adapter with the base model and push the merged model to Hugging Face Hub.
7 3:56 Install required libraries: torch, transformers, datasets, accelerate, bitsandbytes, peft, trl, evaluate.
8 5:30 Load the base model with 4-bit quantization (NF4) and add a padding token to the tokenizer.
9 8:19 Convert the dataset to a DataFrame, format examples using the chat template, count tokens, filter out examples >512 tokens, and split into train/validation/test sets.
10 17:36 Use a data collator that masks input tokens with -100 to calculate loss only on the completion.
11 19:05 Configure LoRA: target all linear layers, rank=32, alpha=16.
12 23:08 Set training arguments: max tokens=512, 1 epoch, batch size=2, gradient accumulation steps=4, 8-bit AdamW optimizer, evaluate every 20% of training, warmup ratio=10%.
13 26:32 Save the LoRA adapter, merge with base model on a P100 GPU, and push to Hugging Face Hub with max shard size of 5GB.

Study Flashcards (15)

What is the name of the dataset used for fine-tuning?

easy Click to reveal answer

Financial Q&A 10K

2:21

How many examples does the Financial Q&A 10K dataset contain?

easy Click to reveal answer

Roughly 7,000 examples

2:21

What base model is fine-tuned in the video?

easy Click to reveal answer

Llama 3 8B Instruct by Meta AI

2:58

What is the context length of the Llama 3 8B Instruct model?

medium Click to reveal answer

8K tokens

3:21

What GPU is used for fine-tuning?

easy Click to reveal answer

T4 GPU (16GB VRAM)

3:56

What quantization format is used to load the model?

medium Click to reveal answer

4-bit NF4

5:30

Why is a padding token added to the tokenizer?

hard Click to reveal answer

To prevent the model from repeating text or generating garbled output during training.

6:45

What is the maximum number of tokens used for training?

medium Click to reveal answer

512 tokens

23:08

What percentage of parameters are trained using LoRA?

hard Click to reveal answer

1.34% (approximately 84 million parameters)

21:42

What are the LoRA rank and alpha values used?

medium Click to reveal answer

Rank=32, Alpha=16

20:22

How many epochs is the model trained for?

easy Click to reveal answer

1 epoch

23:28

What optimizer is used during training?

medium Click to reveal answer

8-bit AdamW (with paged memory)

24:09

How is the loss calculated during training?

hard Click to reveal answer

Only on the completion tokens (not the prompt), using a data collator that masks input tokens with -100.

17:36

What is the name of the fine-tuned model pushed to Hugging Face Hub?

medium Click to reveal answer

Llama-3-8B-Instruct-Finance-R

27:28

What is the maximum shard size used when pushing the model?

medium Click to reveal answer

5 GB

27:24

💡 Key Takeaways

🔧

Complete fine-tuning pipeline

Provides a step-by-step blueprint for fine-tuning a large language model on custom data, from dataset creation to deployment.

0:55
🔧

Completion-only loss calculation

Explains a key technique to mask input tokens so loss is only computed on generated output, improving training efficiency and model quality.

17:36
📊

LoRA parameter efficiency

Demonstrates that only 1.34% of parameters need to be trained to achieve significant performance gains, making fine-tuning accessible on consumer hardware.

21:42
💡

Qualitative comparison of fine-tuned vs base model

Shows concrete examples where the fine-tuned model produces concise, accurate answers while the base model remains verbose and off-format.

28:45

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

No viral clips found for this video, or they are still being generated.

[00:00] what can you do to improve the

[00:02] performance of your watch language model

[00:05] for your specific use case hey everyone

[00:08] my name is Vin and in this video we're

[00:10] going to see how you can find you a

[00:12] watch language model on a custom data

[00:14] set here we're going to be using W 38b

[00:18] instruct model and we are going to be

[00:20] fine-tuning it for a rock application

[00:22] for financial data let's get started if

[00:25] you want to follow along there is a

[00:27] complete text tutorial that is available

[00:29] for m expert Pro subscribers and it is

[00:32] right under the boot camp and then fine

[00:34] tuning W3 L for R here you can find the

[00:38] complete text tutorial along with the

[00:40] source code and explanations on each of

[00:43] the steps that we're going to do along

[00:45] with a link to a Google clap notebook so

[00:48] if you want to support my work please

[00:50] consider subscribing for M expert pro

[00:52] thank you here is the process that we're

[00:55] going to go through in order to find you

[00:57] our W 3 model for our specific task

[01:01] first we're going to be building a data

[01:04] set that is based on custom prompts

[01:07] provided from a Json file that I'm going

[01:09] to show you how you can transform into

[01:12] hugging phase data set then we're going

[01:14] to be choosing and evaluating the

[01:16] initial performance of the base model in

[01:18] our case this is going to be the W 38b

[01:21] instruct model then we're going to be

[01:23] setting up an adapter and in our case

[01:26] this is going to be a war adapter that

[01:29] we're going to be using using in order

[01:30] to tune on top of the original W 3 Model

[01:35] since the W 3 Model is quite large and

[01:38] probably you're not going to be able to

[01:40] do a fine-tuning of the complete model

[01:42] on a single GPU then we are going to be

[01:46] continuing with training and monitoring

[01:48] the training process I'm going to show

[01:50] you the results that I got and this

[01:52] model was trained in roughly 2 hours for

[01:55] a single ook then we're going to be

[01:58] creating an evaluation on a previously

[02:01] created test set and based on this

[02:03] evaluation we're going to be merging the

[02:06] based model that we have and we're going

[02:08] to be pushing the model to H face Hub

[02:11] and I'm going to show you uh some

[02:14] examples on how the trained model is

[02:16] comparing the predictions to the

[02:18] untrained model the data set that we're

[02:21] going to be using is available on the

[02:23] huging face data sets it is called

[02:25] Financial Q&A 10K and here you can find

[02:29] roughly 7 ,000 examples that are

[02:33] essentially paired with a question

[02:35] context and an answer these are the

[02:38] columns that we're going to be using of

[02:39] course uh you can infer from the name

[02:42] that this is actually a financial data

[02:44] set and uh you can see that uh the two

[02:47] additional coms are filing and then

[02:49] ticker we are not going to be using

[02:51] those but we are going to be uh

[02:53] deploying the question answer and the

[02:55] context the base model that we're going

[02:58] to be using is the original

[03:00] wama 38b instruct model by meta AI which

[03:03] is also available on the H face models

[03:06] repository and this model is going to be

[03:09] a we're going to be able to put this

[03:11] model on a single GPU with a

[03:13] quantization to four bit parameters and

[03:16] I'm going to show you how to do that

[03:17] into the co notebook other than that a

[03:21] thing that you should know about this

[03:22] model is that it has a Contex length of

[03:25] 8K tokens which will be quite more than

[03:28] we need in order to find tun for our

[03:30] specific data set and this model has to

[03:33] be one of the better open models that

[03:36] you can use uh at least today so we're

[03:39] going to be fine-tuning this another

[03:41] bonus of this model is that it has a

[03:43] chat template which uh is provided by

[03:46] the tokenizer as you can see here and we

[03:49] are going to be using this chat template

[03:51] in order to further fine-tune this model

[03:54] I have the Google clap notebook now

[03:56] opened and as you can see first I'm

[03:58] starting with showing you that the

[04:00] actual GPU that I've used during this

[04:03] fine tuning was a T4 I'm going to show

[04:06] you how we can fit the model on the T4

[04:08] GPU in a bit and here I'm installing

[04:10] pretty much the latest versions of the

[04:12] torch Library the Transformers Library

[04:15] data set since we're going to be

[04:16] downloading the data set from the

[04:18] hanging face repository then the

[04:20] accelerate library and bits and bites

[04:22] which we're going to be using for the

[04:23] quantization of the model then uh for

[04:26] the war setup we're going to be using

[04:28] the P Library then we're going to be

[04:31] using the TRL sft trainer or supervised

[04:34] fine tuning uh trainer that is provided

[04:37] by this labrary and then the covered

[04:39] Library which I'm going to show you why

[04:40] we're going to be using in a bit uh we

[04:43] have a lot of imports and most of those

[04:45] are based on the fact that I'm going to

[04:48] show you a couple of uh plots uh but the

[04:51] more important thing here is that I'm

[04:54] seeding the uh torch and the numai and

[04:58] the random uh from the python with a

[05:01] seat and then I'm specifying a p token

[05:04] I'm going to show you how you can apply

[05:06] the P token to the tokenizer since the

[05:09] tokenizer at least for w 38b instruct

[05:12] model doesn't come with a PO token

[05:14] included so we're going to be doing just

[05:17] that in a bit and then uh I'm going to

[05:20] be having a a constant for the original

[05:24] model and then the new model that I'm

[05:25] going to show you how you can push to

[05:27] the Hang face Repository

[05:30] so first I'm going to start with uh

[05:32] creating the configuration for the model

[05:34] itself and here you can see that I have

[05:37] something very basic I'm loading the

[05:39] model into 4 bit and I'm using the new

[05:43] word nf4 uh format for the Quant type of

[05:47] the 4bit model and uh here I'm saying

[05:50] that the compute type which are we're

[05:52] going to be using for the computational

[05:54] part of this model is going to be a

[05:56] binary for 16 uh other than that this is

[05:59] pretty much a very standard

[06:02] configuration for wading the model into

[06:04] 4bit format uh next I'm going to show

[06:07] you that uh we are actually downloading

[06:10] the original tokenizer from The Meta

[06:12] repository and I'm adding a p token

[06:16] which is going to be this P token

[06:18] constant right here and I'm setting the

[06:20] Ping side to the right this is just in

[06:22] case uh if this is not set already and

[06:26] then I'm loading the model from the

[06:28] quantization and then after I've

[06:31] downloaded or loaded the model you'll

[06:34] see that I'm actually expanding or

[06:36] resizing the token in Bings for this

[06:38] model based on the length of the

[06:39] tokenizer since we've added a new token

[06:42] right here now why I do that uh from

[06:45] what I found if you're training with

[06:48] more than one training example per batch

[06:52] I've seen that usually the embeddings or

[06:56] the tokenizer is getting scrumbled and

[06:59] it appears that the at least the was and

[07:02] the responses don't get very good and

[07:05] what I found is that the models continue

[07:07] to Jumble or try to speak a lot and

[07:10] repeat some of the sentences if I set

[07:13] this padding token it appears that uh

[07:16] the model is actually stopping to

[07:17] generate itself as it should and uh this

[07:21] actually helped me to consider that also

[07:24] I would like to know that I've tried to

[07:26] actually fine tune the base model uh

[07:29] that that is the model that didn't uh

[07:31] include any instruct fine tuning and on

[07:34] that model also without the P token uh

[07:37] it appears that uh this model continues

[07:39] to uh repeat the text uh forever and

[07:42] ever so if you have another solution to

[07:44] this problem please let me know down

[07:46] into the to the comments of this video

[07:49] uh and you'll see that we're downloading

[07:51] the model uh you can see that the model

[07:53] was able to be wed successfully and this

[07:56] is the config you can see that we are

[07:59] actually only adding the quantization

[08:01] config right

[08:03] here uh other than that uh I'm showing

[08:06] you the beginning of SE of sequence

[08:09] token the end of sequence token and the

[08:12] new P token that we've added those are

[08:15] already into our tokenizer okay so I'm

[08:19] going to continue with the original data

[08:21] set and here I'm going to show you how

[08:24] you can essentially create your own

[08:26] custom data set so you don't have to

[08:28] rely on on some pre-processed data set

[08:31] and for example you can have a data

[08:33] frame or Json and from that uh you can

[08:36] actually create your own custom hugging

[08:39] phase data set so I'm going to start by

[08:42] downloading the original hugging face

[08:45] data set and I'm going to convert it

[08:46] into a data frame I'm going to see a

[08:49] couple of examples right here these are

[08:51] the columns that we have originally and

[08:54] the first thing is as I've already told

[08:56] you I'm going to convert this data set

[08:58] into a data frame so this is something

[09:00] that you might have in the real world uh

[09:03] for example a data frame or a CSV file

[09:05] or uh you can have some SQL or uh SQL

[09:10] database that you can convert into a CSV

[09:12] file or a data frame and from here we're

[09:15] going to be building our custom data set

[09:17] and this is how uh I'm going to do this

[09:21] so first uh something that I really like

[09:24] to do is to check whether or not this

[09:27] data set contains any new values since

[09:29] this will probably W up our gradients

[09:33] during training and our was is not going

[09:35] to be very happy with that so I see that

[09:38] pretty much uh everything is here we

[09:41] have 7,000 examples and then after this

[09:44] is complete I'm going to be building

[09:46] this function called format example in

[09:49] which I'm going to be using the question

[09:52] the answer and the context for a

[09:54] specific question along with this very

[09:57] simple system prompt on top of that I'm

[10:00] going to be calling apply chat template

[10:03] and I don't want this to get tokenized

[10:05] so in order to get these messages and

[10:09] run through this I'm going to show you

[10:12] that this is going to be running through

[10:15] every example and I'm going to be adding

[10:18] a new com com text to our data

[10:21] frame and then I'm going to continue

[10:23] with counting the actual tokens that our

[10:27] tokenizer is going to be doing in order

[10:30] to have their count into our final data

[10:33] frame and this is something that you

[10:35] might get uh for example here is a data

[10:39] frame or a sample of the first couple of

[10:42] examples five to be exact and you can

[10:44] see the question the context the answer

[10:46] and now we have the text along with a

[10:49] token count for each text I'm going to

[10:52] show you why we're going to be using

[10:54] this but let's see a simple example or

[10:57] the first example that we get

[10:59] from the text here you can see that the

[11:02] tokenizer has added all of the specific

[11:05] tokens that are actually included within

[11:08] the template you can see the system

[11:10] prompt then you can see that uh the

[11:14] question is actually

[11:17] here sorry this is uh the system prompt

[11:20] then this is the question from our

[11:22] specific case and then this is the

[11:24] context provided here between these

[11:27] triple digs uh this is ending right here

[11:31] and then we have a answer from the

[11:34] assistant so this is going to be the

[11:37] answer from our data set and then we

[11:39] have end of sequence ID token at the end

[11:42] so this is pretty much the format that

[11:44] the model is going to be receiving our

[11:46] texting and then I'm showing you a

[11:49] histogram or let's say a plot that tells

[11:53] how often tokens be between for example

[11:57] 100 uh Zer and 200 100 Etc tokens are

[12:02] relevant here and you can see our data

[12:04] set is heavily skewed towards uh 300 or

[12:08] less tokens right here which is a good

[12:11] thing since we want to reduce the number

[12:14] of tokens that we're going to be using

[12:16] in order to have a a faster training so

[12:20] this is a good for us and uh I'm going

[12:23] to be actually reducing the number of

[12:25] tokens under 512 and in our case we

[12:30] seeing that only three of the examples

[12:32] right here have more than 5 12 tokens so

[12:35] what I'm going to do is to actually

[12:37] remove those

[12:39] examples uh and then I'm going to sample

[12:42] uh 6,000 examples and based on that I'm

[12:46] going to be splitting those into a train

[12:48] validation and test sets so to continue

[12:51] with that I'm going to be using the

[12:53] train test split from the sk1 library

[12:56] I'm going to be first creating a train

[12:58] set and then the rest of the data set

[13:01] I'm going to be splitting that into a

[13:02] validation and test sets so these are

[13:05] the results that I have and from that

[13:07] I'm going to be saving roughly 4,000

[13:10] examples for training 500 for validation

[13:14] and4 testing and this essentially is

[13:18] going to be our data set that we're

[13:21] going to be building and I'm going to be

[13:23] using two Json on the data frame that we

[13:27] have uh I'm going to orient towards the

[13:29] records and I want this to be stored as

[13:32] Json wines or Json l so essentially what

[13:36] I'm going to do next is to get or W our

[13:40] custom data set that we've just created

[13:42] and this is essentially how you are

[13:45] going to be wading a Json file and this

[13:47] is the mapping between the Json files so

[13:51] what we have here is our own custom data

[13:53] set that we pre-processed enabled and

[13:55] created finally based on the Json and

[13:58] then uh at the was step we're actually

[14:00] loading our own custom data set so this

[14:03] is essentially the process that you need

[14:06] to follow in order to build a data set

[14:09] for fine-tuning your

[14:12] L next I'm going to show you that uh

[14:15] actually our data set is correctly split

[14:17] you can see the number of rows right

[14:19] here and I'm going to just be looking at

[14:22] another example of the text which is

[14:24] again a text with all of the tokens that

[14:27] are needed to be applied based on the

[14:29] chat template okay so next we're going

[14:33] to continue with testing the original

[14:36] model this is be before fine-tuning the

[14:40] base model that is I'm going to be

[14:42] creating this pipeline I'm going to be

[14:44] pipelining the model in the tokenizer

[14:47] this is for the text generation task and

[14:49] I want this to produce as much as uh

[14:52] 128 tokens at

[14:55] most so I'm going to be creating this

[14:58] helper function

[14:59] which essentially goes through the

[15:02] example right here and does the exact

[15:05] same thing that we've did before but it

[15:07] is actually removing the original or the

[15:10] uh final answer or the correct answer

[15:12] from The Prompt and this is actually the

[15:15] test prom that we're going to be

[15:16] building here is an example of that uh

[15:19] one important thing here to note is that

[15:21] I'm adding add generation prompt equal

[15:23] to true so this will actually add this

[15:28] part to the prompt

[15:30] uh which you don't have to do on your

[15:32] own and again the model is going to be

[15:35] promptly uh formatted

[15:38] promptly all right so this is the

[15:40] example right now and if I run the

[15:44] prompt through the pipeline you'll see

[15:47] that this is the original answer and

[15:50] this is the prediction for our model you

[15:53] can also see that this took us roughly

[15:55] 10 seconds uh in order to produce the

[16:00] uh prediction which is quite slow at

[16:02] least on this GPU but yeah the GPU is

[16:05] quite slow as well

[16:08] so oh this is the first example let's

[16:10] see another

[16:12] one uh how did the company Net earnings

[16:16] amount to in fisal 2022 net earnings

[16:19] were 17.1 billion in fisal 2022 so

[16:23] relatively straightforward question in a

[16:26] context let's see uh you you can see

[16:29] that the answer was pretty simple uh but

[16:32] H 3 was quite verbos at least with the

[16:36] prompt of course uh if you play around

[16:38] with the prompt you might get better

[16:40] results uh but yeah probably uh with

[16:44] some fine tuning you get still better

[16:46] results another example let's see at the

[16:50] answer and very very both answer right

[16:54] here compared to the original very

[16:56] simple answer so uh I'm going to

[17:00] essentially get the 100 example in the

[17:03] test date sets and I'm going to be

[17:04] running the predictions throughout the

[17:08] uh pipeline that we have so we can

[17:10] compare the results at the end to the

[17:12] train model and of course this model is

[17:15] quite verbos I'm not sure if it is

[17:18] correct uh at all of the prompts but at

[17:21] least in my experience I'm not very

[17:24] happy with that and probably I would go

[17:27] with further tuning the model changing

[17:29] it all together uh tuning the prompts or

[17:32] completely fine-tuning it based on the

[17:34] performance that you

[17:36] require another thing that I'm going to

[17:39] show you is uh I've seen a lot of

[17:41] examples of fine-tuning those watch

[17:44] language models but most of the times

[17:47] the wor function was calculated on the

[17:50] complete generation of the text which is

[17:53] something that we don't really want

[17:55] since we want to only judge how well the

[18:00] performance of the generation is doing

[18:03] but not the performance of the already

[18:06] inputed text so what I'm going to do is

[18:09] to get the final token of the head and

[18:12] header ID let me show you this so this

[18:16] is this token right

[18:18] here and after that I'm going to be only

[18:22] uh looking at the was after this token

[18:26] so you can see that this data cator for

[18:29] completion only uh language modeling

[18:32] task is going to be essentially masking

[18:35] the tokens with minus 100 so this will

[18:39] not be calculated during the was so this

[18:41] will also speed up the calculation or

[18:44] the training process that you have and

[18:47] all of the rest tokens are going to be

[18:49] used for calculating the loss

[18:51] essentially so pretty neat trick uh if

[18:54] you want to essentially speed up or get

[18:57] even better results with this type of

[19:00] collator which is available from the

[19:02] Transformers library of course okay so

[19:05] we have the collator we have the DAT set

[19:07] let's see what we have for the model so

[19:11] what I do in order to choose which

[19:13] layers to Target with the War uh fine

[19:17] tuning is uh pretty much I'm going to be

[19:19] choosing each linear layer right here

[19:22] and I would say that the wama

[19:24] architecture is pretty straightforward

[19:26] with the wama decoder layer so I'm going

[19:29] to be using the query key value and then

[19:33] pretty much every linear layer that we

[19:36] have right here and for the MLP part

[19:39] this was the attention part of the

[19:41] architecture if you will and for the uh

[19:44] multilayer perceptron layer whatever uh

[19:48] I'm going to be essentially targeting

[19:50] again all of the layers that are of

[19:54] course linear as well so this is

[19:56] something that is coming from the origin

[19:59] War paper I believe and if I recall

[20:01] correctly they were specifying that you

[20:03] need to Target all the linear layers

[20:05] this is how they get the best results

[20:08] possible and in our case I'm going to

[20:12] specify this linear layers right here

[20:14] within the target modules and I'm going

[20:17] to be specifying the coal language

[20:19] modeling task along with a rank of the

[20:22] war config of 32 and War Alpha of 16 and

[20:27] if you're not familiar with the War

[20:29] fine-tuning uh there is a video on my

[20:31] channel that uh pretty much describes in

[20:33] a bit more detail how war is performing

[20:37] but essentially this is uh you can think

[20:39] of of creating a smaller model on top of

[20:42] the original model and this smaller

[20:44] model you're going to be essentially

[20:46] fine-tuning only the weights of this

[20:47] small model while freezing the lch model

[20:51] on the bottom of it and when a

[20:53] prediction comes uh the prediction is

[20:56] going to go through the original model

[20:58] and then it is going to go through your

[21:00] own fine tuned adapter on top of that so

[21:03] this is the way that I pretty much think

[21:06] of when thinking of War models and then

[21:09] I'm going to be preparing this model for

[21:12] kbit training since we are using

[21:13] quantization right here and then I'm

[21:16] going to be applying the war config on

[21:19] top of the model that we have which is

[21:22] again the original W 3 Model so how many

[21:25] parameters we actually going to train

[21:26] with uh you can see here that of course

[21:29] the model offers roughly uh all the

[21:33] parameters uh are roughly 8 billion

[21:35] parameters while we're going to be

[21:37] training only about

[21:42] 1.34% or roughly 84 million parameters

[21:47] on top of that and this is uh actually a

[21:51] very good Ru of temp if the model is

[21:54] watch enough think of like five six or

[21:57] more billion parameter models then

[21:59] probably 1% or even half% of the

[22:02] parameters uh depending on some

[22:05] experiments that you might do are going

[22:07] to be enough in order to train the model

[22:09] on your specific tasks of course this

[22:11] will depend on the DAT set and the

[22:13] complexity of the task that you're going

[22:15] to be doing but roughly 1 half% 1 and a

[22:20] half% is a good R of temp for larger

[22:24] LS and next I'm going to be wading the

[22:27] tensor board with this model I'm going

[22:29] to go through the training itself in a

[22:31] bit so I want to give a big shout out to

[22:35] Philip Schmidt and I'm going to link

[22:36] down his blog into the description of

[22:38] this video but more importantly he

[22:41] specified this part right here uh which

[22:43] is very important we don't want the

[22:46] tokenizer to add any special tokens and

[22:49] we don't want any additional separator

[22:51] tokens this is provided via the DAT set

[22:53] keyword arguments of the sft trainer uh

[22:57] and again this book post is very nice

[22:59] how to findun L in 2024 with hugging

[23:02] face so go and have a read on top of

[23:05] that so back to our config as you can

[23:08] see we have a lot of configuration here

[23:11] uh I'm specifying the maximum number of

[23:13] tokens uh

[23:15] 512 uh this is based of course on the uh

[23:20] experience that we got with the token

[23:22] counts the text field that we're going

[23:24] to be using is just going to be the text

[23:27] uh we're going to be training for a

[23:28] single Epoch probably it would be great

[23:31] to train for more uh and probably you'll

[23:34] get even better results for example two

[23:36] eox might be great so uh let me know if

[23:39] you train the model for two eox and let

[23:42] me know of the results so I'm going to

[23:44] be training on the T4 so this pretty

[23:47] much allows me to have uh two examples

[23:50] per batch uh I'm going to do the same

[23:53] thing for the evaluation and I'm

[23:55] accumulating for four this is actually

[23:58] for 4 * 2 so the gradient accumulation

[24:00] is going to be doing eight samples for

[24:03] the gradient update which is uh quite

[24:06] good at least on a single GPU uh I'm

[24:09] going to be using the special item with

[24:11] wayk fix page Optimizer that is uh I

[24:15] believe coming from the bits and B

[24:17] Library as well and this is for the 8bit

[24:20] optimization so this Optimizer is quite

[24:22] good it appears to be working quite well

[24:24] and quite fast on top of that uh next

[24:27] I'm going to be ass

[24:29] evaluating every uh 20% of the training

[24:32] process and uh running through the Valu

[24:35] U sorry the validation set I have a very

[24:38] small warning rate which appears to be

[24:40] working quite all right uh also I have a

[24:44] very small warm up ratio about 10% so

[24:46] during this time uh yeah actually this

[24:49] is quite redundant since I'm using a

[24:52] constant uh warning rate schedule but

[24:55] I've tried with linear it appears to be

[24:58] doing something but not that impressed

[25:01] with it and I want the responses or the

[25:03] results to be in a safe tensor format

[25:06] and these are the arguments that I'm

[25:08] going to be essentially getting from the

[25:09] Philip Schmid blog post that I've shown

[25:11] you and I'm seeding the training process

[25:15] itself not really sure if this is going

[25:17] to be completely reproducible for you

[25:20] but it appears to be doing something for

[25:22] the seating of the values at least uh

[25:24] when you have the correctly seated data

[25:27] set and then the training itself is

[25:30] quite straightforward I'm going to be

[25:31] passing the configuration the model the

[25:33] DAT set for training for the validation

[25:35] the tokenizer and the cleor uh which is

[25:38] again going to be calculating the was

[25:41] only on the parts that are going to get

[25:43] completed by the model and then uh you

[25:47] can see that I'm essentially calling the

[25:49] dot train method and this is the result

[25:53] from this you can see that the training

[25:56] is uh some somewhat junky if you will uh

[26:00] but it goes quite well the validation

[26:04] was on the other hand is also uh

[26:07] decreasing somewhat but it is quite

[26:11] slower in the decrease rate uh I recall

[26:14] that we have only 500 examples for the

[26:16] validation probably if you increase that

[26:18] to let's say 1,000 or 2,000 you will

[26:21] probably get a much smoother validation

[26:23] most and again if you train the model

[26:26] for a bit longer you probably get some

[26:29] more of better results as well okay so

[26:32] after this is complete I'm going to be

[26:34] saving the model into our uh loal

[26:39] repository or file system and after that

[26:43] I'm going to be essentially Waring the

[26:46] model uh again this is done on the

[26:50] another actually I did this on a p100

[26:53] since the GPU memory for the T4 wasn't

[26:56] enough to Lo the model without the

[26:59] tokenization or the quantization sorry

[27:02] and I essentially wed the model with the

[27:04] p 100 uh GPU applied the P adapter on

[27:09] top of that and then merged the model

[27:11] into a single model and what I did after

[27:14] that is to essentially upload the model

[27:17] and the tokenizer to the hugging face

[27:20] Hub and I wanted this to be split into

[27:24] maximum shite of 5 GB so this is the

[27:28] public model that is available on the H

[27:30] face models it is called W 38b instruct

[27:33] Finance R and here you can find the

[27:35] complete text tutorial or sample

[27:37] examples along with some of the

[27:39] predictions that I got from this model

[27:41] uh more importantly you can find the

[27:43] files you can see these are essentially

[27:46] the tensors with the sharts of 5 GB at

[27:49] most which is quite good along with the

[27:53] generation config and the tokenizer

[27:55] itself along with a sample of the

[27:57] predictions

[27:58] and then we also have the training

[28:00] metrics that are available for the

[28:03] tensor board and uh let's see what we

[28:06] got here I'm going to show you

[28:11] something so here you can look through

[28:14] the

[28:15] complete training process you can see

[28:18] that it took at least for the validation

[28:20] was uh hour and a half and it appears to

[28:24] be performing quite well again probably

[28:27] you're going to be uh quite happy with

[28:29] deploying

[28:30] this or earning this for a bit longer

[28:34] and this is uh the training course uh

[28:36] roughly hour and 40 minutes but again

[28:39] the complete training walk is available

[28:42] within the H face repository so we have

[28:45] the trend model and now I'm going to

[28:48] essentially W our data set once more and

[28:52] just for producibility of course and I'm

[28:55] going to be downloading the model from

[28:57] the huging face up I'm going to be

[28:59] applying the quantization that I did

[29:01] with the original model so we are going

[29:03] to be doing a completely Fair uh

[29:07] comparison between the base model and

[29:08] the finetune version of the model also

[29:11] I'm going to be uh getting the tokenizer

[29:14] from our own repository since it

[29:16] contains all the padding config Etc and

[29:20] this is going to be go aheading and

[29:22] getting all of the data for our model

[29:25] again I'm going to be creating a

[29:27] pipeline and in this pipeline I'm going

[29:29] to be seing or expecting at most 128

[29:33] tokens so this is again the first uh the

[29:37] first response that I got and this is

[29:40] now the prediction of the model uh I'm

[29:43] going to show you a couple of

[29:44] comparisons in a bit but this is now

[29:47] much more aligned with what we have in

[29:49] the original data set not these bullet

[29:52] list points that we got in the original

[29:55] uh next the answer from the prediction

[29:59] here again quite uh Compact and very

[30:02] like what we get in the data set here

[30:06] next I'm going to show you another

[30:08] example uh here you can see that our

[30:10] even our fun model is quite

[30:14] verbos yeah it did uh provide a lot of

[30:17] text but again uh the response is

[30:20] correct let's see how many examples we

[30:23] are going to be getting here and how

[30:26] we're going to compare those to the

[30:28] train prediction so this is the

[30:30] predictions data frame and I'm going to

[30:32] be essentially creating or adding those

[30:35] predictions of with the train

[30:38] model uh so I'm going to be taking a

[30:41] sample of 20 examples and we're going to

[30:44] go through some examples together uh the

[30:47] first example this is the train model

[30:49] and this is the untrained one uh you can

[30:51] see that we got a much better response

[30:53] from the train model at least based on

[30:56] our qualitive uh analysis again here the

[31:00] formatting and the words appear to be

[31:03] quite

[31:04] well matched to the ones that we have

[31:08] from the uh train model compared to the

[31:11] untrained

[31:13] model okay

[31:16] next uh you can see that the train model

[31:19] is actually providing a very short

[31:23] response compared to the answer in the

[31:25] data set uh on that case I'm not really

[31:29] sure if this is completely answering the

[31:31] question but at least it appears to be

[31:34] that our model is uh very biased towards

[31:37] shorter answers on some occasions of

[31:39] course uh okay so uh here another

[31:45] example mechanical engineering from

[31:47] University of California and from

[31:50] Stanford School Etc again this appears

[31:53] to be quite well

[31:55] written and this is uh let's say an

[31:59] additional word that I would not like to

[32:02] see into my rock system uh and this is

[32:06] the case when you don't fine tune at

[32:08] least you're prompt enough with those

[32:11] types of models something that we are

[32:12] not seeing into the fine tune model

[32:15] again a very good example based on our

[32:17] fine

[32:18] tuning uh another

[32:21] example where the unra model is adding a

[32:24] bit more verbosity and uh some

[32:26] formatting that is actually not

[32:30] needed concrete number here well the

[32:34] untrained one has a lot of verbosity

[32:37] yeah you can you can go through those

[32:38] examples and you'll probably be quite

[32:42] happy with the results that you get from

[32:44] the fine tuning and probably if you do

[32:46] some more fine tuning you'll be even

[32:48] happier with the results so this is it

[32:50] for this video we've seen how you can f

[32:52] tune a w 38b instruct model on a custom

[32:56] data set and we've seen how much better

[32:59] this model is performing based on our

[33:01] fine tuning compared to the base model

[33:04] so what do you think is this model

[33:06] performing much better or is it

[33:09] exceeding your expectations let me know

[33:11] down into the comments below thanks for

[33:13] watching guys please like share and

[33:15] subscribe also join the Discord channel

[33:18] that I'm going to link down into the

[33:19] description and I'm going to see you in

[33:21] the next one bye

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.