AI Summary
This video demonstrates how to fine-tune a Llama 3 8B instruct model on a custom financial dataset for a RAG Q&A application, all on a single GPU. The process covers dataset preparation, LoRA adapter setup, training, evaluation, and model deployment to Hugging Face Hub.
Chapters
Steps include building a dataset from custom prompts, evaluating base model performance, setting up a LoRA adapter, training, evaluating on a test set, merging adapters, and pushing to Hugging Face Hub.
Uses the Financial Q&A 10K dataset from Hugging Face, containing ~7,000 examples with question, context, and answer columns. The dataset is financial in nature.
Uses Meta AI's Llama 3 8B instruct model, quantized to 4-bit for single GPU training. It has an 8K token context length and a built-in chat template.
Fine-tuning performed on a T4 GPU (16GB VRAM). Libraries used: PyTorch, Transformers, Datasets, Accelerate, bitsandbytes, PEFT, TRL (SFTTrainer), and evaluate.
Model loaded in 4-bit using NF4 quantization. A padding token is added to the tokenizer to prevent generation issues during training.
Converts the Hugging Face dataset to a DataFrame, formats examples using the chat template, counts tokens, filters out examples >512 tokens, and splits into train/validation/test sets.
Tests the base model on 100 test examples. The base model is verbose and often produces bullet points or extra text, not matching the concise answers in the dataset.
Uses a data collator that masks input tokens with -100 so loss is calculated only on the generated completion, not the prompt. This speeds up training and improves results.
Targets all linear layers in the Llama architecture (query, key, value, MLP). LoRA rank=32, alpha=16. Only 1.34% of parameters (~84 million) are trained.
Max tokens=512, 1 epoch, batch size=2 with gradient accumulation steps=4 (effective batch size=8). Uses 8-bit AdamW optimizer, evaluates every 20% of training, warmup ratio=10%.
After training, the LoRA adapter is merged with the base model on a P100 GPU (T4 memory insufficient). The merged model is pushed to Hugging Face Hub as 'Llama-3-8B-Instruct-Finance-R'.
The fine-tuned model produces much shorter, more concise answers that match the dataset format. The base model remains verbose and adds unnecessary formatting.
Fine-tuning Llama 3 8B on a custom financial dataset significantly improves response quality, making it more concise and aligned with the desired format. The process is feasible on a single GPU using LoRA and 4-bit quantization.
Mentioned in this Video
PyTorch
tool
Transformers
tool
Datasets
tool
Accelerate
tool
bitsandbytes
tool
PEFT
tool
TRL (SFTTrainer)
tool
evaluate
tool
TensorBoard
tool
Hugging Face Hub
service
Philip Schmidt
person
Blog post: How to fine-tune LLMs in 2024 with Hugging Face
link
Financial Q&A 10K
dataset
Llama-3-8B-Instruct-Finance-R
model
Tutorial Checklist
Study Flashcards (15)
What is the name of the dataset used for fine-tuning?
easy
Click to reveal answer
What is the name of the dataset used for fine-tuning?
Financial Q&A 10K
2:21
How many examples does the Financial Q&A 10K dataset contain?
easy
Click to reveal answer
How many examples does the Financial Q&A 10K dataset contain?
Roughly 7,000 examples
2:21
What base model is fine-tuned in the video?
easy
Click to reveal answer
What base model is fine-tuned in the video?
Llama 3 8B Instruct by Meta AI
2:58
What is the context length of the Llama 3 8B Instruct model?
medium
Click to reveal answer
What is the context length of the Llama 3 8B Instruct model?
8K tokens
3:21
What GPU is used for fine-tuning?
easy
Click to reveal answer
What GPU is used for fine-tuning?
T4 GPU (16GB VRAM)
3:56
What quantization format is used to load the model?
medium
Click to reveal answer
What quantization format is used to load the model?
4-bit NF4
5:30
Why is a padding token added to the tokenizer?
hard
Click to reveal answer
Why is a padding token added to the tokenizer?
To prevent the model from repeating text or generating garbled output during training.
6:45
What is the maximum number of tokens used for training?
medium
Click to reveal answer
What is the maximum number of tokens used for training?
512 tokens
23:08
What percentage of parameters are trained using LoRA?
hard
Click to reveal answer
What percentage of parameters are trained using LoRA?
1.34% (approximately 84 million parameters)
21:42
What are the LoRA rank and alpha values used?
medium
Click to reveal answer
What are the LoRA rank and alpha values used?
Rank=32, Alpha=16
20:22
How many epochs is the model trained for?
easy
Click to reveal answer
How many epochs is the model trained for?
1 epoch
23:28
What optimizer is used during training?
medium
Click to reveal answer
What optimizer is used during training?
8-bit AdamW (with paged memory)
24:09
How is the loss calculated during training?
hard
Click to reveal answer
How is the loss calculated during training?
Only on the completion tokens (not the prompt), using a data collator that masks input tokens with -100.
17:36
What is the name of the fine-tuned model pushed to Hugging Face Hub?
medium
Click to reveal answer
What is the name of the fine-tuned model pushed to Hugging Face Hub?
Llama-3-8B-Instruct-Finance-R
27:28
What is the maximum shard size used when pushing the model?
medium
Click to reveal answer
What is the maximum shard size used when pushing the model?
5 GB
27:24
💡 Key Takeaways
Complete fine-tuning pipeline
Provides a step-by-step blueprint for fine-tuning a large language model on custom data, from dataset creation to deployment.
0:55Completion-only loss calculation
Explains a key technique to mask input tokens so loss is only computed on generated output, improving training efficiency and model quality.
17:36LoRA parameter efficiency
Demonstrates that only 1.34% of parameters need to be trained to achieve significant performance gains, making fine-tuning accessible on consumer hardware.
21:42Qualitative comparison of fine-tuned vs base model
Shows concrete examples where the fine-tuned model produces concise, accurate answers while the base model remains verbose and off-format.
28:45Full Transcript
[00:00] what can you do to improve the
[00:02] performance of your watch language model
[00:05] for your specific use case hey everyone
[00:08] my name is Vin and in this video we're
[00:10] going to see how you can find you a
[00:12] watch language model on a custom data
[00:14] set here we're going to be using W 38b
[00:18] instruct model and we are going to be
[00:20] fine-tuning it for a rock application
[00:22] for financial data let's get started if
[00:25] you want to follow along there is a
[00:27] complete text tutorial that is available
[00:29] for m expert Pro subscribers and it is
[00:32] right under the boot camp and then fine
[00:34] tuning W3 L for R here you can find the
[00:38] complete text tutorial along with the
[00:40] source code and explanations on each of
[00:43] the steps that we're going to do along
[00:45] with a link to a Google clap notebook so
[00:48] if you want to support my work please
[00:50] consider subscribing for M expert pro
[00:52] thank you here is the process that we're
[00:55] going to go through in order to find you
[00:57] our W 3 model for our specific task
[01:01] first we're going to be building a data
[01:04] set that is based on custom prompts
[01:07] provided from a Json file that I'm going
[01:09] to show you how you can transform into
[01:12] hugging phase data set then we're going
[01:14] to be choosing and evaluating the
[01:16] initial performance of the base model in
[01:18] our case this is going to be the W 38b
[01:21] instruct model then we're going to be
[01:23] setting up an adapter and in our case
[01:26] this is going to be a war adapter that
[01:29] we're going to be using using in order
[01:30] to tune on top of the original W 3 Model
[01:35] since the W 3 Model is quite large and
[01:38] probably you're not going to be able to
[01:40] do a fine-tuning of the complete model
[01:42] on a single GPU then we are going to be
[01:46] continuing with training and monitoring
[01:48] the training process I'm going to show
[01:50] you the results that I got and this
[01:52] model was trained in roughly 2 hours for
[01:55] a single ook then we're going to be
[01:58] creating an evaluation on a previously
[02:01] created test set and based on this
[02:03] evaluation we're going to be merging the
[02:06] based model that we have and we're going
[02:08] to be pushing the model to H face Hub
[02:11] and I'm going to show you uh some
[02:14] examples on how the trained model is
[02:16] comparing the predictions to the
[02:18] untrained model the data set that we're
[02:21] going to be using is available on the
[02:23] huging face data sets it is called
[02:25] Financial Q&A 10K and here you can find
[02:29] roughly 7 ,000 examples that are
[02:33] essentially paired with a question
[02:35] context and an answer these are the
[02:38] columns that we're going to be using of
[02:39] course uh you can infer from the name
[02:42] that this is actually a financial data
[02:44] set and uh you can see that uh the two
[02:47] additional coms are filing and then
[02:49] ticker we are not going to be using
[02:51] those but we are going to be uh
[02:53] deploying the question answer and the
[02:55] context the base model that we're going
[02:58] to be using is the original
[03:00] wama 38b instruct model by meta AI which
[03:03] is also available on the H face models
[03:06] repository and this model is going to be
[03:09] a we're going to be able to put this
[03:11] model on a single GPU with a
[03:13] quantization to four bit parameters and
[03:16] I'm going to show you how to do that
[03:17] into the co notebook other than that a
[03:21] thing that you should know about this
[03:22] model is that it has a Contex length of
[03:25] 8K tokens which will be quite more than
[03:28] we need in order to find tun for our
[03:30] specific data set and this model has to
[03:33] be one of the better open models that
[03:36] you can use uh at least today so we're
[03:39] going to be fine-tuning this another
[03:41] bonus of this model is that it has a
[03:43] chat template which uh is provided by
[03:46] the tokenizer as you can see here and we
[03:49] are going to be using this chat template
[03:51] in order to further fine-tune this model
[03:54] I have the Google clap notebook now
[03:56] opened and as you can see first I'm
[03:58] starting with showing you that the
[04:00] actual GPU that I've used during this
[04:03] fine tuning was a T4 I'm going to show
[04:06] you how we can fit the model on the T4
[04:08] GPU in a bit and here I'm installing
[04:10] pretty much the latest versions of the
[04:12] torch Library the Transformers Library
[04:15] data set since we're going to be
[04:16] downloading the data set from the
[04:18] hanging face repository then the
[04:20] accelerate library and bits and bites
[04:22] which we're going to be using for the
[04:23] quantization of the model then uh for
[04:26] the war setup we're going to be using
[04:28] the P Library then we're going to be
[04:31] using the TRL sft trainer or supervised
[04:34] fine tuning uh trainer that is provided
[04:37] by this labrary and then the covered
[04:39] Library which I'm going to show you why
[04:40] we're going to be using in a bit uh we
[04:43] have a lot of imports and most of those
[04:45] are based on the fact that I'm going to
[04:48] show you a couple of uh plots uh but the
[04:51] more important thing here is that I'm
[04:54] seeding the uh torch and the numai and
[04:58] the random uh from the python with a
[05:01] seat and then I'm specifying a p token
[05:04] I'm going to show you how you can apply
[05:06] the P token to the tokenizer since the
[05:09] tokenizer at least for w 38b instruct
[05:12] model doesn't come with a PO token
[05:14] included so we're going to be doing just
[05:17] that in a bit and then uh I'm going to
[05:20] be having a a constant for the original
[05:24] model and then the new model that I'm
[05:25] going to show you how you can push to
[05:27] the Hang face Repository
[05:30] so first I'm going to start with uh
[05:32] creating the configuration for the model
[05:34] itself and here you can see that I have
[05:37] something very basic I'm loading the
[05:39] model into 4 bit and I'm using the new
[05:43] word nf4 uh format for the Quant type of
[05:47] the 4bit model and uh here I'm saying
[05:50] that the compute type which are we're
[05:52] going to be using for the computational
[05:54] part of this model is going to be a
[05:56] binary for 16 uh other than that this is
[05:59] pretty much a very standard
[06:02] configuration for wading the model into
[06:04] 4bit format uh next I'm going to show
[06:07] you that uh we are actually downloading
[06:10] the original tokenizer from The Meta
[06:12] repository and I'm adding a p token
[06:16] which is going to be this P token
[06:18] constant right here and I'm setting the
[06:20] Ping side to the right this is just in
[06:22] case uh if this is not set already and
[06:26] then I'm loading the model from the
[06:28] quantization and then after I've
[06:31] downloaded or loaded the model you'll
[06:34] see that I'm actually expanding or
[06:36] resizing the token in Bings for this
[06:38] model based on the length of the
[06:39] tokenizer since we've added a new token
[06:42] right here now why I do that uh from
[06:45] what I found if you're training with
[06:48] more than one training example per batch
[06:52] I've seen that usually the embeddings or
[06:56] the tokenizer is getting scrumbled and
[06:59] it appears that the at least the was and
[07:02] the responses don't get very good and
[07:05] what I found is that the models continue
[07:07] to Jumble or try to speak a lot and
[07:10] repeat some of the sentences if I set
[07:13] this padding token it appears that uh
[07:16] the model is actually stopping to
[07:17] generate itself as it should and uh this
[07:21] actually helped me to consider that also
[07:24] I would like to know that I've tried to
[07:26] actually fine tune the base model uh
[07:29] that that is the model that didn't uh
[07:31] include any instruct fine tuning and on
[07:34] that model also without the P token uh
[07:37] it appears that uh this model continues
[07:39] to uh repeat the text uh forever and
[07:42] ever so if you have another solution to
[07:44] this problem please let me know down
[07:46] into the to the comments of this video
[07:49] uh and you'll see that we're downloading
[07:51] the model uh you can see that the model
[07:53] was able to be wed successfully and this
[07:56] is the config you can see that we are
[07:59] actually only adding the quantization
[08:01] config right
[08:03] here uh other than that uh I'm showing
[08:06] you the beginning of SE of sequence
[08:09] token the end of sequence token and the
[08:12] new P token that we've added those are
[08:15] already into our tokenizer okay so I'm
[08:19] going to continue with the original data
[08:21] set and here I'm going to show you how
[08:24] you can essentially create your own
[08:26] custom data set so you don't have to
[08:28] rely on on some pre-processed data set
[08:31] and for example you can have a data
[08:33] frame or Json and from that uh you can
[08:36] actually create your own custom hugging
[08:39] phase data set so I'm going to start by
[08:42] downloading the original hugging face
[08:45] data set and I'm going to convert it
[08:46] into a data frame I'm going to see a
[08:49] couple of examples right here these are
[08:51] the columns that we have originally and
[08:54] the first thing is as I've already told
[08:56] you I'm going to convert this data set
[08:58] into a data frame so this is something
[09:00] that you might have in the real world uh
[09:03] for example a data frame or a CSV file
[09:05] or uh you can have some SQL or uh SQL
[09:10] database that you can convert into a CSV
[09:12] file or a data frame and from here we're
[09:15] going to be building our custom data set
[09:17] and this is how uh I'm going to do this
[09:21] so first uh something that I really like
[09:24] to do is to check whether or not this
[09:27] data set contains any new values since
[09:29] this will probably W up our gradients
[09:33] during training and our was is not going
[09:35] to be very happy with that so I see that
[09:38] pretty much uh everything is here we
[09:41] have 7,000 examples and then after this
[09:44] is complete I'm going to be building
[09:46] this function called format example in
[09:49] which I'm going to be using the question
[09:52] the answer and the context for a
[09:54] specific question along with this very
[09:57] simple system prompt on top of that I'm
[10:00] going to be calling apply chat template
[10:03] and I don't want this to get tokenized
[10:05] so in order to get these messages and
[10:09] run through this I'm going to show you
[10:12] that this is going to be running through
[10:15] every example and I'm going to be adding
[10:18] a new com com text to our data
[10:21] frame and then I'm going to continue
[10:23] with counting the actual tokens that our
[10:27] tokenizer is going to be doing in order
[10:30] to have their count into our final data
[10:33] frame and this is something that you
[10:35] might get uh for example here is a data
[10:39] frame or a sample of the first couple of
[10:42] examples five to be exact and you can
[10:44] see the question the context the answer
[10:46] and now we have the text along with a
[10:49] token count for each text I'm going to
[10:52] show you why we're going to be using
[10:54] this but let's see a simple example or
[10:57] the first example that we get
[10:59] from the text here you can see that the
[11:02] tokenizer has added all of the specific
[11:05] tokens that are actually included within
[11:08] the template you can see the system
[11:10] prompt then you can see that uh the
[11:14] question is actually
[11:17] here sorry this is uh the system prompt
[11:20] then this is the question from our
[11:22] specific case and then this is the
[11:24] context provided here between these
[11:27] triple digs uh this is ending right here
[11:31] and then we have a answer from the
[11:34] assistant so this is going to be the
[11:37] answer from our data set and then we
[11:39] have end of sequence ID token at the end
[11:42] so this is pretty much the format that
[11:44] the model is going to be receiving our
[11:46] texting and then I'm showing you a
[11:49] histogram or let's say a plot that tells
[11:53] how often tokens be between for example
[11:57] 100 uh Zer and 200 100 Etc tokens are
[12:02] relevant here and you can see our data
[12:04] set is heavily skewed towards uh 300 or
[12:08] less tokens right here which is a good
[12:11] thing since we want to reduce the number
[12:14] of tokens that we're going to be using
[12:16] in order to have a a faster training so
[12:20] this is a good for us and uh I'm going
[12:23] to be actually reducing the number of
[12:25] tokens under 512 and in our case we
[12:30] seeing that only three of the examples
[12:32] right here have more than 5 12 tokens so
[12:35] what I'm going to do is to actually
[12:37] remove those
[12:39] examples uh and then I'm going to sample
[12:42] uh 6,000 examples and based on that I'm
[12:46] going to be splitting those into a train
[12:48] validation and test sets so to continue
[12:51] with that I'm going to be using the
[12:53] train test split from the sk1 library
[12:56] I'm going to be first creating a train
[12:58] set and then the rest of the data set
[13:01] I'm going to be splitting that into a
[13:02] validation and test sets so these are
[13:05] the results that I have and from that
[13:07] I'm going to be saving roughly 4,000
[13:10] examples for training 500 for validation
[13:14] and4 testing and this essentially is
[13:18] going to be our data set that we're
[13:21] going to be building and I'm going to be
[13:23] using two Json on the data frame that we
[13:27] have uh I'm going to orient towards the
[13:29] records and I want this to be stored as
[13:32] Json wines or Json l so essentially what
[13:36] I'm going to do next is to get or W our
[13:40] custom data set that we've just created
[13:42] and this is essentially how you are
[13:45] going to be wading a Json file and this
[13:47] is the mapping between the Json files so
[13:51] what we have here is our own custom data
[13:53] set that we pre-processed enabled and
[13:55] created finally based on the Json and
[13:58] then uh at the was step we're actually
[14:00] loading our own custom data set so this
[14:03] is essentially the process that you need
[14:06] to follow in order to build a data set
[14:09] for fine-tuning your
[14:12] L next I'm going to show you that uh
[14:15] actually our data set is correctly split
[14:17] you can see the number of rows right
[14:19] here and I'm going to just be looking at
[14:22] another example of the text which is
[14:24] again a text with all of the tokens that
[14:27] are needed to be applied based on the
[14:29] chat template okay so next we're going
[14:33] to continue with testing the original
[14:36] model this is be before fine-tuning the
[14:40] base model that is I'm going to be
[14:42] creating this pipeline I'm going to be
[14:44] pipelining the model in the tokenizer
[14:47] this is for the text generation task and
[14:49] I want this to produce as much as uh
[14:52] 128 tokens at
[14:55] most so I'm going to be creating this
[14:58] helper function
[14:59] which essentially goes through the
[15:02] example right here and does the exact
[15:05] same thing that we've did before but it
[15:07] is actually removing the original or the
[15:10] uh final answer or the correct answer
[15:12] from The Prompt and this is actually the
[15:15] test prom that we're going to be
[15:16] building here is an example of that uh
[15:19] one important thing here to note is that
[15:21] I'm adding add generation prompt equal
[15:23] to true so this will actually add this
[15:28] part to the prompt
[15:30] uh which you don't have to do on your
[15:32] own and again the model is going to be
[15:35] promptly uh formatted
[15:38] promptly all right so this is the
[15:40] example right now and if I run the
[15:44] prompt through the pipeline you'll see
[15:47] that this is the original answer and
[15:50] this is the prediction for our model you
[15:53] can also see that this took us roughly
[15:55] 10 seconds uh in order to produce the
[16:00] uh prediction which is quite slow at
[16:02] least on this GPU but yeah the GPU is
[16:05] quite slow as well
[16:08] so oh this is the first example let's
[16:10] see another
[16:12] one uh how did the company Net earnings
[16:16] amount to in fisal 2022 net earnings
[16:19] were 17.1 billion in fisal 2022 so
[16:23] relatively straightforward question in a
[16:26] context let's see uh you you can see
[16:29] that the answer was pretty simple uh but
[16:32] H 3 was quite verbos at least with the
[16:36] prompt of course uh if you play around
[16:38] with the prompt you might get better
[16:40] results uh but yeah probably uh with
[16:44] some fine tuning you get still better
[16:46] results another example let's see at the
[16:50] answer and very very both answer right
[16:54] here compared to the original very
[16:56] simple answer so uh I'm going to
[17:00] essentially get the 100 example in the
[17:03] test date sets and I'm going to be
[17:04] running the predictions throughout the
[17:08] uh pipeline that we have so we can
[17:10] compare the results at the end to the
[17:12] train model and of course this model is
[17:15] quite verbos I'm not sure if it is
[17:18] correct uh at all of the prompts but at
[17:21] least in my experience I'm not very
[17:24] happy with that and probably I would go
[17:27] with further tuning the model changing
[17:29] it all together uh tuning the prompts or
[17:32] completely fine-tuning it based on the
[17:34] performance that you
[17:36] require another thing that I'm going to
[17:39] show you is uh I've seen a lot of
[17:41] examples of fine-tuning those watch
[17:44] language models but most of the times
[17:47] the wor function was calculated on the
[17:50] complete generation of the text which is
[17:53] something that we don't really want
[17:55] since we want to only judge how well the
[18:00] performance of the generation is doing
[18:03] but not the performance of the already
[18:06] inputed text so what I'm going to do is
[18:09] to get the final token of the head and
[18:12] header ID let me show you this so this
[18:16] is this token right
[18:18] here and after that I'm going to be only
[18:22] uh looking at the was after this token
[18:26] so you can see that this data cator for
[18:29] completion only uh language modeling
[18:32] task is going to be essentially masking
[18:35] the tokens with minus 100 so this will
[18:39] not be calculated during the was so this
[18:41] will also speed up the calculation or
[18:44] the training process that you have and
[18:47] all of the rest tokens are going to be
[18:49] used for calculating the loss
[18:51] essentially so pretty neat trick uh if
[18:54] you want to essentially speed up or get
[18:57] even better results with this type of
[19:00] collator which is available from the
[19:02] Transformers library of course okay so
[19:05] we have the collator we have the DAT set
[19:07] let's see what we have for the model so
[19:11] what I do in order to choose which
[19:13] layers to Target with the War uh fine
[19:17] tuning is uh pretty much I'm going to be
[19:19] choosing each linear layer right here
[19:22] and I would say that the wama
[19:24] architecture is pretty straightforward
[19:26] with the wama decoder layer so I'm going
[19:29] to be using the query key value and then
[19:33] pretty much every linear layer that we
[19:36] have right here and for the MLP part
[19:39] this was the attention part of the
[19:41] architecture if you will and for the uh
[19:44] multilayer perceptron layer whatever uh
[19:48] I'm going to be essentially targeting
[19:50] again all of the layers that are of
[19:54] course linear as well so this is
[19:56] something that is coming from the origin
[19:59] War paper I believe and if I recall
[20:01] correctly they were specifying that you
[20:03] need to Target all the linear layers
[20:05] this is how they get the best results
[20:08] possible and in our case I'm going to
[20:12] specify this linear layers right here
[20:14] within the target modules and I'm going
[20:17] to be specifying the coal language
[20:19] modeling task along with a rank of the
[20:22] war config of 32 and War Alpha of 16 and
[20:27] if you're not familiar with the War
[20:29] fine-tuning uh there is a video on my
[20:31] channel that uh pretty much describes in
[20:33] a bit more detail how war is performing
[20:37] but essentially this is uh you can think
[20:39] of of creating a smaller model on top of
[20:42] the original model and this smaller
[20:44] model you're going to be essentially
[20:46] fine-tuning only the weights of this
[20:47] small model while freezing the lch model
[20:51] on the bottom of it and when a
[20:53] prediction comes uh the prediction is
[20:56] going to go through the original model
[20:58] and then it is going to go through your
[21:00] own fine tuned adapter on top of that so
[21:03] this is the way that I pretty much think
[21:06] of when thinking of War models and then
[21:09] I'm going to be preparing this model for
[21:12] kbit training since we are using
[21:13] quantization right here and then I'm
[21:16] going to be applying the war config on
[21:19] top of the model that we have which is
[21:22] again the original W 3 Model so how many
[21:25] parameters we actually going to train
[21:26] with uh you can see here that of course
[21:29] the model offers roughly uh all the
[21:33] parameters uh are roughly 8 billion
[21:35] parameters while we're going to be
[21:37] training only about
[21:42] 1.34% or roughly 84 million parameters
[21:47] on top of that and this is uh actually a
[21:51] very good Ru of temp if the model is
[21:54] watch enough think of like five six or
[21:57] more billion parameter models then
[21:59] probably 1% or even half% of the
[22:02] parameters uh depending on some
[22:05] experiments that you might do are going
[22:07] to be enough in order to train the model
[22:09] on your specific tasks of course this
[22:11] will depend on the DAT set and the
[22:13] complexity of the task that you're going
[22:15] to be doing but roughly 1 half% 1 and a
[22:20] half% is a good R of temp for larger
[22:24] LS and next I'm going to be wading the
[22:27] tensor board with this model I'm going
[22:29] to go through the training itself in a
[22:31] bit so I want to give a big shout out to
[22:35] Philip Schmidt and I'm going to link
[22:36] down his blog into the description of
[22:38] this video but more importantly he
[22:41] specified this part right here uh which
[22:43] is very important we don't want the
[22:46] tokenizer to add any special tokens and
[22:49] we don't want any additional separator
[22:51] tokens this is provided via the DAT set
[22:53] keyword arguments of the sft trainer uh
[22:57] and again this book post is very nice
[22:59] how to findun L in 2024 with hugging
[23:02] face so go and have a read on top of
[23:05] that so back to our config as you can
[23:08] see we have a lot of configuration here
[23:11] uh I'm specifying the maximum number of
[23:13] tokens uh
[23:15] 512 uh this is based of course on the uh
[23:20] experience that we got with the token
[23:22] counts the text field that we're going
[23:24] to be using is just going to be the text
[23:27] uh we're going to be training for a
[23:28] single Epoch probably it would be great
[23:31] to train for more uh and probably you'll
[23:34] get even better results for example two
[23:36] eox might be great so uh let me know if
[23:39] you train the model for two eox and let
[23:42] me know of the results so I'm going to
[23:44] be training on the T4 so this pretty
[23:47] much allows me to have uh two examples
[23:50] per batch uh I'm going to do the same
[23:53] thing for the evaluation and I'm
[23:55] accumulating for four this is actually
[23:58] for 4 * 2 so the gradient accumulation
[24:00] is going to be doing eight samples for
[24:03] the gradient update which is uh quite
[24:06] good at least on a single GPU uh I'm
[24:09] going to be using the special item with
[24:11] wayk fix page Optimizer that is uh I
[24:15] believe coming from the bits and B
[24:17] Library as well and this is for the 8bit
[24:20] optimization so this Optimizer is quite
[24:22] good it appears to be working quite well
[24:24] and quite fast on top of that uh next
[24:27] I'm going to be ass
[24:29] evaluating every uh 20% of the training
[24:32] process and uh running through the Valu
[24:35] U sorry the validation set I have a very
[24:38] small warning rate which appears to be
[24:40] working quite all right uh also I have a
[24:44] very small warm up ratio about 10% so
[24:46] during this time uh yeah actually this
[24:49] is quite redundant since I'm using a
[24:52] constant uh warning rate schedule but
[24:55] I've tried with linear it appears to be
[24:58] doing something but not that impressed
[25:01] with it and I want the responses or the
[25:03] results to be in a safe tensor format
[25:06] and these are the arguments that I'm
[25:08] going to be essentially getting from the
[25:09] Philip Schmid blog post that I've shown
[25:11] you and I'm seeding the training process
[25:15] itself not really sure if this is going
[25:17] to be completely reproducible for you
[25:20] but it appears to be doing something for
[25:22] the seating of the values at least uh
[25:24] when you have the correctly seated data
[25:27] set and then the training itself is
[25:30] quite straightforward I'm going to be
[25:31] passing the configuration the model the
[25:33] DAT set for training for the validation
[25:35] the tokenizer and the cleor uh which is
[25:38] again going to be calculating the was
[25:41] only on the parts that are going to get
[25:43] completed by the model and then uh you
[25:47] can see that I'm essentially calling the
[25:49] dot train method and this is the result
[25:53] from this you can see that the training
[25:56] is uh some somewhat junky if you will uh
[26:00] but it goes quite well the validation
[26:04] was on the other hand is also uh
[26:07] decreasing somewhat but it is quite
[26:11] slower in the decrease rate uh I recall
[26:14] that we have only 500 examples for the
[26:16] validation probably if you increase that
[26:18] to let's say 1,000 or 2,000 you will
[26:21] probably get a much smoother validation
[26:23] most and again if you train the model
[26:26] for a bit longer you probably get some
[26:29] more of better results as well okay so
[26:32] after this is complete I'm going to be
[26:34] saving the model into our uh loal
[26:39] repository or file system and after that
[26:43] I'm going to be essentially Waring the
[26:46] model uh again this is done on the
[26:50] another actually I did this on a p100
[26:53] since the GPU memory for the T4 wasn't
[26:56] enough to Lo the model without the
[26:59] tokenization or the quantization sorry
[27:02] and I essentially wed the model with the
[27:04] p 100 uh GPU applied the P adapter on
[27:09] top of that and then merged the model
[27:11] into a single model and what I did after
[27:14] that is to essentially upload the model
[27:17] and the tokenizer to the hugging face
[27:20] Hub and I wanted this to be split into
[27:24] maximum shite of 5 GB so this is the
[27:28] public model that is available on the H
[27:30] face models it is called W 38b instruct
[27:33] Finance R and here you can find the
[27:35] complete text tutorial or sample
[27:37] examples along with some of the
[27:39] predictions that I got from this model
[27:41] uh more importantly you can find the
[27:43] files you can see these are essentially
[27:46] the tensors with the sharts of 5 GB at
[27:49] most which is quite good along with the
[27:53] generation config and the tokenizer
[27:55] itself along with a sample of the
[27:57] predictions
[27:58] and then we also have the training
[28:00] metrics that are available for the
[28:03] tensor board and uh let's see what we
[28:06] got here I'm going to show you
[28:11] something so here you can look through
[28:14] the
[28:15] complete training process you can see
[28:18] that it took at least for the validation
[28:20] was uh hour and a half and it appears to
[28:24] be performing quite well again probably
[28:27] you're going to be uh quite happy with
[28:29] deploying
[28:30] this or earning this for a bit longer
[28:34] and this is uh the training course uh
[28:36] roughly hour and 40 minutes but again
[28:39] the complete training walk is available
[28:42] within the H face repository so we have
[28:45] the trend model and now I'm going to
[28:48] essentially W our data set once more and
[28:52] just for producibility of course and I'm
[28:55] going to be downloading the model from
[28:57] the huging face up I'm going to be
[28:59] applying the quantization that I did
[29:01] with the original model so we are going
[29:03] to be doing a completely Fair uh
[29:07] comparison between the base model and
[29:08] the finetune version of the model also
[29:11] I'm going to be uh getting the tokenizer
[29:14] from our own repository since it
[29:16] contains all the padding config Etc and
[29:20] this is going to be go aheading and
[29:22] getting all of the data for our model
[29:25] again I'm going to be creating a
[29:27] pipeline and in this pipeline I'm going
[29:29] to be seing or expecting at most 128
[29:33] tokens so this is again the first uh the
[29:37] first response that I got and this is
[29:40] now the prediction of the model uh I'm
[29:43] going to show you a couple of
[29:44] comparisons in a bit but this is now
[29:47] much more aligned with what we have in
[29:49] the original data set not these bullet
[29:52] list points that we got in the original
[29:55] uh next the answer from the prediction
[29:59] here again quite uh Compact and very
[30:02] like what we get in the data set here
[30:06] next I'm going to show you another
[30:08] example uh here you can see that our
[30:10] even our fun model is quite
[30:14] verbos yeah it did uh provide a lot of
[30:17] text but again uh the response is
[30:20] correct let's see how many examples we
[30:23] are going to be getting here and how
[30:26] we're going to compare those to the
[30:28] train prediction so this is the
[30:30] predictions data frame and I'm going to
[30:32] be essentially creating or adding those
[30:35] predictions of with the train
[30:38] model uh so I'm going to be taking a
[30:41] sample of 20 examples and we're going to
[30:44] go through some examples together uh the
[30:47] first example this is the train model
[30:49] and this is the untrained one uh you can
[30:51] see that we got a much better response
[30:53] from the train model at least based on
[30:56] our qualitive uh analysis again here the
[31:00] formatting and the words appear to be
[31:03] quite
[31:04] well matched to the ones that we have
[31:08] from the uh train model compared to the
[31:11] untrained
[31:13] model okay
[31:16] next uh you can see that the train model
[31:19] is actually providing a very short
[31:23] response compared to the answer in the
[31:25] data set uh on that case I'm not really
[31:29] sure if this is completely answering the
[31:31] question but at least it appears to be
[31:34] that our model is uh very biased towards
[31:37] shorter answers on some occasions of
[31:39] course uh okay so uh here another
[31:45] example mechanical engineering from
[31:47] University of California and from
[31:50] Stanford School Etc again this appears
[31:53] to be quite well
[31:55] written and this is uh let's say an
[31:59] additional word that I would not like to
[32:02] see into my rock system uh and this is
[32:06] the case when you don't fine tune at
[32:08] least you're prompt enough with those
[32:11] types of models something that we are
[32:12] not seeing into the fine tune model
[32:15] again a very good example based on our
[32:17] fine
[32:18] tuning uh another
[32:21] example where the unra model is adding a
[32:24] bit more verbosity and uh some
[32:26] formatting that is actually not
[32:30] needed concrete number here well the
[32:34] untrained one has a lot of verbosity
[32:37] yeah you can you can go through those
[32:38] examples and you'll probably be quite
[32:42] happy with the results that you get from
[32:44] the fine tuning and probably if you do
[32:46] some more fine tuning you'll be even
[32:48] happier with the results so this is it
[32:50] for this video we've seen how you can f
[32:52] tune a w 38b instruct model on a custom
[32:56] data set and we've seen how much better
[32:59] this model is performing based on our
[33:01] fine tuning compared to the base model
[33:04] so what do you think is this model
[33:06] performing much better or is it
[33:09] exceeding your expectations let me know
[33:11] down into the comments below thanks for
[33:13] watching guys please like share and
[33:15] subscribe also join the Discord channel
[33:18] that I'm going to link down into the
[33:19] description and I'm going to see you in
[33:21] the next one bye