---
title: 'Fine-Tuning Llama 3 on a Custom Dataset: Training LLM for a RAG Q&A Use Case on a Single GPU'
source: 'https://youtube.com/watch?v=0XPZlR3_GgI'
video_id: '0XPZlR3_GgI'
date: 2026-06-16
duration_sec: 0
---

# Fine-Tuning Llama 3 on a Custom Dataset: Training LLM for a RAG Q&A Use Case on a Single GPU

> Source: [Fine-Tuning Llama 3 on a Custom Dataset: Training LLM for a RAG Q&A Use Case on a Single GPU](https://youtube.com/watch?v=0XPZlR3_GgI)

## Summary

This video demonstrates how to fine-tune a Llama 3 8B instruct model on a custom financial dataset for a RAG Q&A application, all on a single GPU. The process covers dataset preparation, LoRA adapter setup, training, evaluation, and model deployment to Hugging Face Hub.

### Key Points

- **Fine-tuning process overview** [0:55] — Steps include building a dataset from custom prompts, evaluating base model performance, setting up a LoRA adapter, training, evaluating on a test set, merging adapters, and pushing to Hugging Face Hub.
- **Dataset: Financial Q&A 10K** [2:21] — Uses the Financial Q&A 10K dataset from Hugging Face, containing ~7,000 examples with question, context, and answer columns. The dataset is financial in nature.
- **Base model: Llama 3 8B Instruct** [2:58] — Uses Meta AI's Llama 3 8B instruct model, quantized to 4-bit for single GPU training. It has an 8K token context length and a built-in chat template.
- **Hardware and libraries** [3:56] — Fine-tuning performed on a T4 GPU (16GB VRAM). Libraries used: PyTorch, Transformers, Datasets, Accelerate, bitsandbytes, PEFT, TRL (SFTTrainer), and evaluate.
- **Model loading and tokenizer setup** [5:30] — Model loaded in 4-bit using NF4 quantization. A padding token is added to the tokenizer to prevent generation issues during training.
- **Custom dataset creation** [8:19] — Converts the Hugging Face dataset to a DataFrame, formats examples using the chat template, counts tokens, filters out examples >512 tokens, and splits into train/validation/test sets.
- **Base model evaluation before fine-tuning** [14:33] — Tests the base model on 100 test examples. The base model is verbose and often produces bullet points or extra text, not matching the concise answers in the dataset.
- **Data collator for completion-only loss** [17:36] — Uses a data collator that masks input tokens with -100 so loss is calculated only on the generated completion, not the prompt. This speeds up training and improves results.
- **LoRA configuration** [19:05] — Targets all linear layers in the Llama architecture (query, key, value, MLP). LoRA rank=32, alpha=16. Only 1.34% of parameters (~84 million) are trained.
- **Training configuration** [23:08] — Max tokens=512, 1 epoch, batch size=2 with gradient accumulation steps=4 (effective batch size=8). Uses 8-bit AdamW optimizer, evaluates every 20% of training, warmup ratio=10%.
- **Model saving and merging** [26:32] — After training, the LoRA adapter is merged with the base model on a P100 GPU (T4 memory insufficient). The merged model is pushed to Hugging Face Hub as 'Llama-3-8B-Instruct-Finance-R'.
- **Comparison: fine-tuned vs base model** [28:45] — The fine-tuned model produces much shorter, more concise answers that match the dataset format. The base model remains verbose and adds unnecessary formatting.

### Conclusion

Fine-tuning Llama 3 8B on a custom financial dataset significantly improves response quality, making it more concise and aligned with the desired format. The process is feasible on a single GPU using LoRA and 4-bit quantization.

## Transcript

what can you do to improve the
performance of your watch language model
for your specific use case hey everyone
my name is Vin and in this video we're
going to see how you can find you a
watch language model on a custom data
set here we're going to be using W 38b
instruct model and we are going to be
fine-tuning it for a rock application
for financial data let's get started if
you want to follow along there is a
complete text tutorial that is available
for m expert Pro subscribers and it is
right under the boot camp and then fine
tuning W3 L for R here you can find the
complete text tutorial along with the
source code and explanations on each of
the steps that we're going to do along
with a link to a Google clap notebook so
if you want to support my work please
consider subscribing for M expert pro
thank you here is the process that we're
going to go through in order to find you
our W 3 model for our specific task
first we're going to be building a data
set that is based on custom prompts
provided from a Json file that I'm going
to show you how you can transform into
hugging phase data set then we're going
to be choosing and evaluating the
initial performance of the base model in
our case this is going to be the W 38b
instruct model then we're going to be
setting up an adapter and in our case
this is going to be a war adapter that
we're going to be using using in order
to tune on top of the original W 3 Model
since the W 3 Model is quite large and
probably you're not going to be able to
do a fine-tuning of the complete model
on a single GPU then we are going to be
continuing with training and monitoring
the training process I'm going to show
you the results that I got and this
model was trained in roughly 2 hours for
a single ook then we're going to be
creating an evaluation on a previously
created test set and based on this
evaluation we're going to be merging the
based model that we have and we're going
to be pushing the model to H face Hub
and I'm going to show you uh some
examples on how the trained model is
comparing the predictions to the
untrained model the data set that we're
going to be using is available on the
huging face data sets it is called
Financial Q&A 10K and here you can find
roughly 7 ,000 examples that are
essentially paired with a question
context and an answer these are the
columns that we're going to be using of
course uh you can infer from the name
that this is actually a financial data
set and uh you can see that uh the two
additional coms are filing and then
ticker we are not going to be using
those but we are going to be uh
deploying the question answer and the
context the base model that we're going
to be using is the original
wama 38b instruct model by meta AI which
is also available on the H face models
repository and this model is going to be
a we're going to be able to put this
model on a single GPU with a
quantization to four bit parameters and
I'm going to show you how to do that
into the co notebook other than that a
thing that you should know about this
model is that it has a Contex length of
8K tokens which will be quite more than
we need in order to find tun for our
specific data set and this model has to
be one of the better open models that
you can use uh at least today so we're
going to be fine-tuning this another
bonus of this model is that it has a
chat template which uh is provided by
the tokenizer as you can see here and we
are going to be using this chat template
in order to further fine-tune this model
I have the Google clap notebook now
opened and as you can see first I'm
starting with showing you that the
actual GPU that I've used during this
fine tuning was a T4 I'm going to show
you how we can fit the model on the T4
GPU in a bit and here I'm installing
pretty much the latest versions of the
torch Library the Transformers Library
data set since we're going to be
downloading the data set from the
hanging face repository then the
accelerate library and bits and bites
which we're going to be using for the
quantization of the model then uh for
the war setup we're going to be using
the P Library then we're going to be
using the TRL sft trainer or supervised
fine tuning uh trainer that is provided
by this labrary and then the covered
Library which I'm going to show you why
we're going to be using in a bit uh we
have a lot of imports and most of those
are based on the fact that I'm going to
show you a couple of uh plots uh but the
more important thing here is that I'm
seeding the uh torch and the numai and
the random uh from the python with a
seat and then I'm specifying a p token
I'm going to show you how you can apply
the P token to the tokenizer since the
tokenizer at least for w 38b instruct
model doesn't come with a PO token
included so we're going to be doing just
that in a bit and then uh I'm going to
be having a a constant for the original
model and then the new model that I'm
going to show you how you can push to
the Hang face Repository
so first I'm going to start with uh
creating the configuration for the model
itself and here you can see that I have
something very basic I'm loading the
model into 4 bit and I'm using the new
word nf4 uh format for the Quant type of
the 4bit model and uh here I'm saying
that the compute type which are we're
going to be using for the computational
part of this model is going to be a
binary for 16 uh other than that this is
pretty much a very standard
configuration for wading the model into
4bit format uh next I'm going to show
you that uh we are actually downloading
the original tokenizer from The Meta
repository and I'm adding a p token
which is going to be this P token
constant right here and I'm setting the
Ping side to the right this is just in
case uh if this is not set already and
then I'm loading the model from the
quantization and then after I've
downloaded or loaded the model you'll
see that I'm actually expanding or
resizing the token in Bings for this
model based on the length of the
tokenizer since we've added a new token
right here now why I do that uh from
what I found if you're training with
more than one training example per batch
I've seen that usually the embeddings or
the tokenizer is getting scrumbled and
it appears that the at least the was and
the responses don't get very good and
what I found is that the models continue
to Jumble or try to speak a lot and
repeat some of the sentences if I set
this padding token it appears that uh
the model is actually stopping to
generate itself as it should and uh this
actually helped me to consider that also
I would like to know that I've tried to
actually fine tune the base model uh
that that is the model that didn't uh
include any instruct fine tuning and on
that model also without the P token uh
it appears that uh this model continues
to uh repeat the text uh forever and
ever so if you have another solution to
this problem please let me know down
into the to the comments of this video
uh and you'll see that we're downloading
the model uh you can see that the model
was able to be wed successfully and this
is the config you can see that we are
actually only adding the quantization
config right
here uh other than that uh I'm showing
you the beginning of SE of sequence
token the end of sequence token and the
new P token that we've added those are
already into our tokenizer okay so I'm
going to continue with the original data
set and here I'm going to show you how
you can essentially create your own
custom data set so you don't have to
rely on on some pre-processed data set
and for example you can have a data
frame or Json and from that uh you can
actually create your own custom hugging
phase data set so I'm going to start by
downloading the original hugging face
data set and I'm going to convert it
into a data frame I'm going to see a
couple of examples right here these are
the columns that we have originally and
the first thing is as I've already told
you I'm going to convert this data set
into a data frame so this is something
that you might have in the real world uh
for example a data frame or a CSV file
or uh you can have some SQL or uh SQL
database that you can convert into a CSV
file or a data frame and from here we're
going to be building our custom data set
and this is how uh I'm going to do this
so first uh something that I really like
to do is to check whether or not this
data set contains any new values since
this will probably W up our gradients
during training and our was is not going
to be very happy with that so I see that
pretty much uh everything is here we
have 7,000 examples and then after this
is complete I'm going to be building
this function called format example in
which I'm going to be using the question
the answer and the context for a
specific question along with this very
simple system prompt on top of that I'm
going to be calling apply chat template
and I don't want this to get tokenized
so in order to get these messages and
run through this I'm going to show you
that this is going to be running through
every example and I'm going to be adding
a new com com text to our data
frame and then I'm going to continue
with counting the actual tokens that our
tokenizer is going to be doing in order
to have their count into our final data
frame and this is something that you
might get uh for example here is a data
frame or a sample of the first couple of
examples five to be exact and you can
see the question the context the answer
and now we have the text along with a
token count for each text I'm going to
show you why we're going to be using
this but let's see a simple example or
the first example that we get
from the text here you can see that the
tokenizer has added all of the specific
tokens that are actually included within
the template you can see the system
prompt then you can see that uh the
question is actually
here sorry this is uh the system prompt
then this is the question from our
specific case and then this is the
context provided here between these
triple digs uh this is ending right here
and then we have a answer from the
assistant so this is going to be the
answer from our data set and then we
have end of sequence ID token at the end
so this is pretty much the format that
the model is going to be receiving our
texting and then I'm showing you a
histogram or let's say a plot that tells
how often tokens be between for example
100 uh Zer and 200 100 Etc tokens are
relevant here and you can see our data
set is heavily skewed towards uh 300 or
less tokens right here which is a good
thing since we want to reduce the number
of tokens that we're going to be using
in order to have a a faster training so
this is a good for us and uh I'm going
to be actually reducing the number of
tokens under 512 and in our case we
seeing that only three of the examples
right here have more than 5 12 tokens so
what I'm going to do is to actually
remove those
examples uh and then I'm going to sample
uh 6,000 examples and based on that I'm
going to be splitting those into a train
validation and test sets so to continue
with that I'm going to be using the
train test split from the sk1 library
I'm going to be first creating a train
set and then the rest of the data set
I'm going to be splitting that into a
validation and test sets so these are
the results that I have and from that
I'm going to be saving roughly 4,000
examples for training 500 for validation
and4 testing and this essentially is
going to be our data set that we're
going to be building and I'm going to be
using two Json on the data frame that we
have uh I'm going to orient towards the
records and I want this to be stored as
Json wines or Json l so essentially what
I'm going to do next is to get or W our
custom data set that we've just created
and this is essentially how you are
going to be wading a Json file and this
is the mapping between the Json files so
what we have here is our own custom data
set that we pre-processed enabled and
created finally based on the Json and
then uh at the was step we're actually
loading our own custom data set so this
is essentially the process that you need
to follow in order to build a data set
for fine-tuning your
L next I'm going to show you that uh
actually our data set is correctly split
you can see the number of rows right
here and I'm going to just be looking at
another example of the text which is
again a text with all of the tokens that
are needed to be applied based on the
chat template okay so next we're going
to continue with testing the original
model this is be before fine-tuning the
base model that is I'm going to be
creating this pipeline I'm going to be
pipelining the model in the tokenizer
this is for the text generation task and
I want this to produce as much as uh
128 tokens at
most so I'm going to be creating this
helper function
which essentially goes through the
example right here and does the exact
same thing that we've did before but it
is actually removing the original or the
uh final answer or the correct answer
from The Prompt and this is actually the
test prom that we're going to be
building here is an example of that uh
one important thing here to note is that
I'm adding add generation prompt equal
to true so this will actually add this
part to the prompt
uh which you don't have to do on your
own and again the model is going to be
promptly uh formatted
promptly all right so this is the
example right now and if I run the
prompt through the pipeline you'll see
that this is the original answer and
this is the prediction for our model you
can also see that this took us roughly
10 seconds uh in order to produce the
uh prediction which is quite slow at
least on this GPU but yeah the GPU is
quite slow as well
so oh this is the first example let's
see another
one uh how did the company Net earnings
amount to in fisal 2022 net earnings
were 17.1 billion in fisal 2022 so
relatively straightforward question in a
context let's see uh you you can see
that the answer was pretty simple uh but
H 3 was quite verbos at least with the
prompt of course uh if you play around
with the prompt you might get better
results uh but yeah probably uh with
some fine tuning you get still better
results another example let's see at the
answer and very very both answer right
here compared to the original very
simple answer so uh I'm going to
essentially get the 100 example in the
test date sets and I'm going to be
running the predictions throughout the
uh pipeline that we have so we can
compare the results at the end to the
train model and of course this model is
quite verbos I'm not sure if it is
correct uh at all of the prompts but at
least in my experience I'm not very
happy with that and probably I would go
with further tuning the model changing
it all together uh tuning the prompts or
completely fine-tuning it based on the
performance that you
require another thing that I'm going to
show you is uh I've seen a lot of
examples of fine-tuning those watch
language models but most of the times
the wor function was calculated on the
complete generation of the text which is
something that we don't really want
since we want to only judge how well the
performance of the generation is doing
but not the performance of the already
inputed text so what I'm going to do is
to get the final token of the head and
header ID let me show you this so this
is this token right
here and after that I'm going to be only
uh looking at the was after this token
so you can see that this data cator for
completion only uh language modeling
task is going to be essentially masking
the tokens with minus 100 so this will
not be calculated during the was so this
will also speed up the calculation or
the training process that you have and
all of the rest tokens are going to be
used for calculating the loss
essentially so pretty neat trick uh if
you want to essentially speed up or get
even better results with this type of
collator which is available from the
Transformers library of course okay so
we have the collator we have the DAT set
let's see what we have for the model so
what I do in order to choose which
layers to Target with the War uh fine
tuning is uh pretty much I'm going to be
choosing each linear layer right here
and I would say that the wama
architecture is pretty straightforward
with the wama decoder layer so I'm going
to be using the query key value and then
pretty much every linear layer that we
have right here and for the MLP part
this was the attention part of the
architecture if you will and for the uh
multilayer perceptron layer whatever uh
I'm going to be essentially targeting
again all of the layers that are of
course linear as well so this is
something that is coming from the origin
War paper I believe and if I recall
correctly they were specifying that you
need to Target all the linear layers
this is how they get the best results
possible and in our case I'm going to
specify this linear layers right here
within the target modules and I'm going
to be specifying the coal language
modeling task along with a rank of the
war config of 32 and War Alpha of 16 and
if you're not familiar with the War
fine-tuning uh there is a video on my
channel that uh pretty much describes in
a bit more detail how war is performing
but essentially this is uh you can think
of of creating a smaller model on top of
the original model and this smaller
model you're going to be essentially
fine-tuning only the weights of this
small model while freezing the lch model
on the bottom of it and when a
prediction comes uh the prediction is
going to go through the original model
and then it is going to go through your
own fine tuned adapter on top of that so
this is the way that I pretty much think
of when thinking of War models and then
I'm going to be preparing this model for
kbit training since we are using
quantization right here and then I'm
going to be applying the war config on
top of the model that we have which is
again the original W 3 Model so how many
parameters we actually going to train
with uh you can see here that of course
the model offers roughly uh all the
parameters uh are roughly 8 billion
parameters while we're going to be
training only about
1.34% or roughly 84 million parameters
on top of that and this is uh actually a
very good Ru of temp if the model is
watch enough think of like five six or
more billion parameter models then
probably 1% or even half% of the
parameters uh depending on some
experiments that you might do are going
to be enough in order to train the model
on your specific tasks of course this
will depend on the DAT set and the
complexity of the task that you're going
to be doing but roughly 1 half% 1 and a
half% is a good R of temp for larger
LS and next I'm going to be wading the
tensor board with this model I'm going
to go through the training itself in a
bit so I want to give a big shout out to
Philip Schmidt and I'm going to link
down his blog into the description of
this video but more importantly he
specified this part right here uh which
is very important we don't want the
tokenizer to add any special tokens and
we don't want any additional separator
tokens this is provided via the DAT set
keyword arguments of the sft trainer uh
and again this book post is very nice
how to findun L in 2024 with hugging
face so go and have a read on top of
that so back to our config as you can
see we have a lot of configuration here
uh I'm specifying the maximum number of
tokens uh
512 uh this is based of course on the uh
experience that we got with the token
counts the text field that we're going
to be using is just going to be the text
uh we're going to be training for a
single Epoch probably it would be great
to train for more uh and probably you'll
get even better results for example two
eox might be great so uh let me know if
you train the model for two eox and let
me know of the results so I'm going to
be training on the T4 so this pretty
much allows me to have uh two examples
per batch uh I'm going to do the same
thing for the evaluation and I'm
accumulating for four this is actually
for 4 * 2 so the gradient accumulation
is going to be doing eight samples for
the gradient update which is uh quite
good at least on a single GPU uh I'm
going to be using the special item with
wayk fix page Optimizer that is uh I
believe coming from the bits and B
Library as well and this is for the 8bit
optimization so this Optimizer is quite
good it appears to be working quite well
and quite fast on top of that uh next
I'm going to be ass
evaluating every uh 20% of the training
process and uh running through the Valu
U sorry the validation set I have a very
small warning rate which appears to be
working quite all right uh also I have a
very small warm up ratio about 10% so
during this time uh yeah actually this
is quite redundant since I'm using a
constant uh warning rate schedule but
I've tried with linear it appears to be
doing something but not that impressed
with it and I want the responses or the
results to be in a safe tensor format
and these are the arguments that I'm
going to be essentially getting from the
Philip Schmid blog post that I've shown
you and I'm seeding the training process
itself not really sure if this is going
to be completely reproducible for you
but it appears to be doing something for
the seating of the values at least uh
when you have the correctly seated data
set and then the training itself is
quite straightforward I'm going to be
passing the configuration the model the
DAT set for training for the validation
the tokenizer and the cleor uh which is
again going to be calculating the was
only on the parts that are going to get
completed by the model and then uh you
can see that I'm essentially calling the
dot train method and this is the result
from this you can see that the training
is uh some somewhat junky if you will uh
but it goes quite well the validation
was on the other hand is also uh
decreasing somewhat but it is quite
slower in the decrease rate uh I recall
that we have only 500 examples for the
validation probably if you increase that
to let's say 1,000 or 2,000 you will
probably get a much smoother validation
most and again if you train the model
for a bit longer you probably get some
more of better results as well okay so
after this is complete I'm going to be
saving the model into our uh loal
repository or file system and after that
I'm going to be essentially Waring the
model uh again this is done on the
another actually I did this on a p100
since the GPU memory for the T4 wasn't
enough to Lo the model without the
tokenization or the quantization sorry
and I essentially wed the model with the
p 100 uh GPU applied the P adapter on
top of that and then merged the model
into a single model and what I did after
that is to essentially upload the model
and the tokenizer to the hugging face
Hub and I wanted this to be split into
maximum shite of 5 GB so this is the
public model that is available on the H
face models it is called W 38b instruct
Finance R and here you can find the
complete text tutorial or sample
examples along with some of the
predictions that I got from this model
uh more importantly you can find the
files you can see these are essentially
the tensors with the sharts of 5 GB at
most which is quite good along with the
generation config and the tokenizer
itself along with a sample of the
predictions
and then we also have the training
metrics that are available for the
tensor board and uh let's see what we
got here I'm going to show you
something so here you can look through
the
complete training process you can see
that it took at least for the validation
was uh hour and a half and it appears to
be performing quite well again probably
you're going to be uh quite happy with
deploying
this or earning this for a bit longer
and this is uh the training course uh
roughly hour and 40 minutes but again
the complete training walk is available
within the H face repository so we have
the trend model and now I'm going to
essentially W our data set once more and
just for producibility of course and I'm
going to be downloading the model from
the huging face up I'm going to be
applying the quantization that I did
with the original model so we are going
to be doing a completely Fair uh
comparison between the base model and
the finetune version of the model also
I'm going to be uh getting the tokenizer
from our own repository since it
contains all the padding config Etc and
this is going to be go aheading and
getting all of the data for our model
again I'm going to be creating a
pipeline and in this pipeline I'm going
to be seing or expecting at most 128
tokens so this is again the first uh the
first response that I got and this is
now the prediction of the model uh I'm
going to show you a couple of
comparisons in a bit but this is now
much more aligned with what we have in
the original data set not these bullet
list points that we got in the original
uh next the answer from the prediction
here again quite uh Compact and very
like what we get in the data set here
next I'm going to show you another
example uh here you can see that our
even our fun model is quite
verbos yeah it did uh provide a lot of
text but again uh the response is
correct let's see how many examples we
are going to be getting here and how
we're going to compare those to the
train prediction so this is the
predictions data frame and I'm going to
be essentially creating or adding those
predictions of with the train
model uh so I'm going to be taking a
sample of 20 examples and we're going to
go through some examples together uh the
first example this is the train model
and this is the untrained one uh you can
see that we got a much better response
from the train model at least based on
our qualitive uh analysis again here the
formatting and the words appear to be
quite
well matched to the ones that we have
from the uh train model compared to the
untrained
model okay
next uh you can see that the train model
is actually providing a very short
response compared to the answer in the
data set uh on that case I'm not really
sure if this is completely answering the
question but at least it appears to be
that our model is uh very biased towards
shorter answers on some occasions of
course uh okay so uh here another
example mechanical engineering from
University of California and from
Stanford School Etc again this appears
to be quite well
written and this is uh let's say an
additional word that I would not like to
see into my rock system uh and this is
the case when you don't fine tune at
least you're prompt enough with those
types of models something that we are
not seeing into the fine tune model
again a very good example based on our
fine
tuning uh another
example where the unra model is adding a
bit more verbosity and uh some
formatting that is actually not
needed concrete number here well the
untrained one has a lot of verbosity
yeah you can you can go through those
examples and you'll probably be quite
happy with the results that you get from
the fine tuning and probably if you do
some more fine tuning you'll be even
happier with the results so this is it
for this video we've seen how you can f
tune a w 38b instruct model on a custom
data set and we've seen how much better
this model is performing based on our
fine tuning compared to the base model
so what do you think is this model
performing much better or is it
exceeding your expectations let me know
down into the comments below thanks for
watching guys please like share and
subscribe also join the Discord channel
that I'm going to link down into the
description and I'm going to see you in
the next one bye
