What is fine-tuning? Plain English
45sExplains a complex AI concept in simple terms, making it accessible to a broad audience.
▶ Play ClipDavid Andre explains how to fine-tune Llama 3 for free using Google Colab and the Unsloth framework. He covers the basics of fine-tuning, data preparation, and step-by-step implementation to adapt the model for specific tasks.
Fine-tuning adapts a pre-trained LLM like Llama 3 to a specific task by adjusting a small portion of its parameters on a focused dataset.
Cost-effective (uses a GPU for hours instead of millions), improved performance on specific tasks, and data-efficient (works with as few as 300-500 entries).
Steps: prepare a tailored dataset, update pre-trained weights using optimization algorithms (only possible with open-weight models), then monitor and refine to prevent overfitting.
Customer service chatbots using proprietary transcripts, content generation in a specific writing style, and domain-specific analysis (e.g., legal or medical texts).
Uses a Google Colab notebook (created with Unsloth) to fine-tune Llama 3 8B for free on a T4 GPU. Steps include checking GPU, installing dependencies, loading the model, and configuring LoRA.
Uses the Alpaca dataset (50,000 rows) with instruction-input-output format. Custom datasets must follow the same structure. Suggests using LLMs to generate larger datasets from a few hand-crafted examples.
Trains for 60 steps (not a full epoch) for demonstration. For production, use multiple epochs and set max_steps to None. Training loss dropped from ~1.9 to ~0.8 in 8 minutes.
The model correctly answered prompts like listing prime numbers (1-50) and converting binary to decimal. Uses text streamer for token-by-token generation.
Save LoRA adapters locally or push to Hugging Face Hub. For inference, load adapters by setting a flag to true. Recommends using Unsloth for faster inference.
Quantize the model (e.g., Q4) for easier deployment on weaker hardware. Can be used with UIs like GPT4All or Oobabooga for easy chatting.
Fine-tuning Llama 3 is accessible and cost-effective, enabling anyone to adapt a powerful LLM to their specific needs using free tools like Google Colab and Unsloth.
"Title accurately reflects content: a practical guide to fine-tuning Llama 3 for custom use cases."
What is fine-tuning in the context of LLMs?
Adapting a pre-trained LLM to a specific task by adjusting a small portion of its parameters on a focused dataset.
00:15
What are three benefits of fine-tuning mentioned?
Cost-effectiveness, improved performance on specific tasks, and data efficiency.
00:52
What is the minimum dataset size suggested for fine-tuning?
300-500 entries.
01:29
What format does the Alpaca dataset use?
Instruction, input (optional), and output.
08:43
What is LoRA and why is it used?
LoRA (Low-Rank Adaptation) is a technique to efficiently update a fraction of parameters, enhancing training speed and reducing computation.
08:00
What does the EOS token do in fine-tuning?
It signals completion of a response, preventing the model from generating text indefinitely.
11:00
How many steps were used for training in the demonstration?
60 steps.
11:39
What was the training loss at the start and end of the 60-step run?
Started around 1.9, ended around 0.8.
14:09
How long did the 60-step training take on a T4 GPU?
Approximately 8 minutes (476 seconds).
14:44
What is the recommended way to save a fine-tuned model for sharing?
Push LoRA adapters to Hugging Face Hub using push_to_hub.
19:28
What is quantization and why is it used?
Quantization compresses the model to reduce memory usage, making it easier to run on weaker hardware.
22:40
Definition of Fine-Tuning
Provides a clear, plain-English explanation of a core concept.
00:15Cost-Effectiveness of Fine-Tuning
Highlights the dramatic cost savings compared to training from scratch.
00:52Customer Service Chatbot Use Case
Illustrates a practical application using proprietary data.
02:48Free Fine-Tuning with Google Colab
Demonstrates that fine-tuning is accessible without expensive hardware.
04:11Dataset Format Requirement
Specifies the exact JSON structure needed for custom datasets.
08:43Training Time and Memory Usage
Provides concrete metrics for a real fine-tuning run.
14:44Saving LoRA Adapters
Explains how to persist fine-tuning results efficiently.
19:28[00:00] my name is David Andre and in this video
[00:02] I'll teach you how to fine tune llama
[00:03] free so that it performs 10 times better
[00:06] for your specific use case let's start
[00:08] with what even is fine tuning and I made
[00:11] this explanation in plain English so
[00:13] that anybody can understand fine-tuning
[00:15] is adapting a pre-trained llm like gbd4
[00:18] or in this case Lama 3 to a specific
[00:21] task or domain it involves adjusting a
[00:24] small portion of the parameters on a
[00:27] more focused data set so you know when a
[00:29] new model releases what everybody needs
[00:31] to know is how many parameters it has we
[00:33] have llama 3 8B and always that number
[00:37] like 8B or 70b that's the number of
[00:40] parameters so we're adjusting just a
[00:41] small number of them to make it more
[00:43] focused on a specific thing fine tuning
[00:47] customizes the outputs to be more
[00:49] relevant and accurate for your use case
[00:52] here's the power of fine tuning cost
[00:54] Effectiveness it leverages the power of
[00:57] pre-trained llms which cost tens of
[00:59] millions of dollar if not hundreds of
[01:01] millions to train and we can just you
[01:04] know run a GPU for a few hours and fine
[01:07] tune something for I don't
[01:09] know like cents a few cents or few
[01:12] dollars at most which is just amazing it
[01:15] gives you improved performance because
[01:16] you can enhance the llm on your data set
[01:20] and improve accuracy for specific tasks
[01:23] and it it also is more data efficient
[01:25] you can achieve excellent results even
[01:27] with smaller data sets so you know maybe
[01:29] maybe even like 300 500 entries while
[01:34] you know llama 3 was trained on 15
[01:36] trillion tokens I don't know about you
[01:38] but I'm not have I don't have nearly as
[01:40] much data as Zak so that's why fine
[01:43] tuning is great for people like you and
[01:45] me so how does llm fine tuning actually
[01:49] work first you need to prepare your data
[01:51] set and this you know depending on how
[01:53] hardcore you want to go this can take
[01:55] anywhere from 20 minutes to a few hours
[01:58] to week
[02:00] potentially depends how far you want to
[02:02] take it so you create a smaller high
[02:04] quality data set tailored to your
[02:05] specific use case and label it
[02:08] appropriately which I'll teach you in a
[02:10] bit the pre-rain llms weights are
[02:12] updated incrementally using the
[02:14] optimization algorithms like grade in
[02:16] descent based on the new dat set so we
[02:19] can only fine-tune uh llms that we have
[02:22] access to the weights meaning open
[02:24] source open weights llms you cannot find
[02:26] you gbt 4 if you are not open AI open
[02:29] can do it obviously but me and you we
[02:32] probably don't have gp4 just laying on
[02:34] our
[02:35] computer then you Monitor and refine you
[02:38] evaluate the model's performance on a
[02:40] validation set preventing overfitting
[02:42] and guide adjustments now here are some
[02:45] real world use cases for fine tuning
[02:48] fine tuning and llm or customer service
[02:50] transcripts can create a chat bot like
[02:53] this one that can address issue in a way
[02:55] specific to the company so let's say you
[02:58] know you have a specific product very
[03:00] Niche that is not there is not much data
[03:03] about it on the internet and if somebody
[03:06] messages your customer support email you
[03:08] want your you know chatbot to respond in
[03:12] a specific way based on the information
[03:14] of your product and that data is
[03:15] proprietary it's private only you have
[03:18] it and you can find you an llm to
[03:21] respond based on that data so like
[03:24] technically if you have enough script
[03:26] you can find you an llm to respond like
[03:28] you and you know if you try sh GPT if
[03:32] you even give like sh GPT some writing
[03:35] and tell it continue in this writing
[03:36] style it's terrible so this is where
[03:38] fine tuning could be better tailored
[03:41] content generation so you can fine tune
[03:43] in llm on your posts and descriptions to
[03:45] create engaging summar or marketing copy
[03:47] again in your writing style tailored to
[03:50] your
[03:51] audience domain specific analysis so
[03:54] fine tuning llm on legal or medical text
[03:57] can make it much better for those
[03:59] specific Benchmark so and you might have
[04:01] a model that let's say it reaches 50 on
[04:04] some arbitrary Benchmark with fine
[04:06] tuning it can reach 70 or 80 now let's
[04:09] dive into how to actually implement this
[04:11] on Lama free so I created this Google
[04:14] collab well actually most of it was
[04:16] created by ansoff team A huge shout out
[04:18] to ansoff because they did all the heavy
[04:20] lifting so I'm going to also link their
[04:22] GitHub below now first off I added a
[04:25] component that's only available in April
[04:27] to the community so if you join during
[04:30] April you will get a personalized AI
[04:32] strategy to Future proof yourself and
[04:34] your business so if you want to be among
[04:36] people who are building the future if
[04:38] you want access to all the different
[04:40] courses modules and everything else in
[04:42] the community and to two we Rec calls
[04:45] then consider joining and especially if
[04:47] you want me to give you a personalized
[04:50] AI strategy to Future PR yourself so if
[04:53] that's interesting to you make sure to
[04:54] join the community it's the first link
[04:56] in the description now let's find youe
[04:58] Lama free shall we so first thing we
[05:01] check the GPU version available in the
[05:03] environment and install specific
[05:05] dependencies that are compatible with
[05:07] the detected GPU to prevent conflict so
[05:09] this is uh this cell by the way if you
[05:11] don't know how uh Google collab Works
[05:14] which is you know the software I'm using
[05:16] right now it's super simple it's
[05:18] basically um splitting the code into
[05:21] cells it's called The jupyter Notebook
[05:23] but it's like much more easier to see
[05:25] you can add text you can add graphics
[05:27] and it's great for like tutorials and
[05:29] explaining right so if you never use
[05:31] this it's great because it's free and
[05:32] Google actually gives you a GPU so you
[05:34] can use this T4 GPU to train this model
[05:38] for free and if you want faster you can
[05:41] obviously upgrade it right so I'm going
[05:43] to link this collab below the video as
[05:45] well so we run this cell which does what
[05:48] I just explained the next cell we need
[05:51] to prepare to load a range of quantied
[05:54] language models including the new 15
[05:56] trillion lvfree model so trained on 15
[05:59] trillion tokens and it's optimized for
[06:02] efficiency with forbit quantization I
[06:03] mean I'm not going to even pretend I
[06:06] know everything about fire tuning
[06:07] because I don't so if you know if um it
[06:11] seems like I have gaps in my knowledge
[06:13] it because it is I do have those gaps in
[06:15] my knowledge so I try to make it as
[06:17] simple as possible but if this proves
[06:19] something it proves that you don't have
[06:20] to be a machine learning expert to find
[06:22] your models so you know just follow
[06:25] along so here this is the max sequence
[06:27] length uh obviously 3 is up to 8,000 so
[06:32] I mean 2,000 is plenty for this
[06:33] demonstration but you can do anything
[06:35] you can do 4,000 or
[06:38] 8,000 here use 4bit quantization to
[06:41] reduce memory usage but it can be false
[06:44] as well so here are the models we can
[06:45] see like we have mro 7B llama 2 which is
[06:49] the old one Gemma from Google but
[06:50] obviously we're interested in llama 3 8B
[06:54] and by the way we can also use llama
[06:55] 370p if you want which obviously will
[06:58] take longer because uh it's a much
[07:00] bigger model so in that case you might
[07:02] uh want to buy the premium version of of
[07:04] collab or just wait for a while but yeah
[07:07] I mean uh everything is the same just
[07:09] here you would change the model to Lama
[07:11] fre 70p and if you want to use like a
[07:15] gated models from hugging face which
[07:18] gated means that you have to usually
[07:20] agree to some you know license or
[07:21] whatever then here just remove the
[07:25] comand and then put your hugging face
[07:27] token here super simple now by the way
[07:31] you always have to run this so what you
[07:32] do when you go to Google collab you
[07:34] click on run time and click run all that
[07:36] way all of the cells run but you can
[07:38] also do it one by one by clicking this
[07:40] button right here next to each cell and
[07:43] it needs to have this little tick green
[07:44] tick that way it was uh executed here
[07:47] it's not because I you know removed the
[07:51] I changed this so anytime you make any
[07:53] change it disappears but that doesn't
[07:54] matter it was still executed so it's
[07:56] stored in the run time next next up we
[08:00] integrate Laura again you don't have to
[08:02] understand what this is but it's
[08:03] basically um way of fine-tuning into our
[08:06] model which allows us to efficiently
[08:08] update just a fraction of the parameters
[08:10] enhancing training speed and reducing
[08:12] computation load so again we are not
[08:15] training the model from scratch we're
[08:16] just fine-tuning a few parameters for
[08:19] our specific use case and here you can
[08:21] change the r to Any number greater than
[08:23] zero 8 16 32 64 up to
[08:27] you and your goals would want to do with
[08:29] it by the way on SLO the reason I'm
[08:31] using it is because it's uh makes fine
[08:34] tuning much faster and consuming less
[08:36] memory so it's actually a great uh great
[08:40] framework for this data prep we now use
[08:43] the alpaka data set from yma which is
[08:46] this one which has 50,000 rows and I
[08:49] have it loaded in vs code here just that
[08:51] way you see how it looks like in Json
[08:53] formatting so you know it's a lot of
[08:56] lines because for everyone it's
[08:57] basically times five yeah so like 200
[08:59] 50,000 uh lines and it's like every one
[09:03] every one of them has an
[09:05] instruction should probably Zoom it up
[09:07] Zoom it
[09:09] in so yeah every one entry has a
[09:13] instruction give fre tips for staying
[09:15] healie input this is not mandatory
[09:18] because instruction is already enough
[09:19] context and then output this is what the
[09:22] llm should say and you do this enough
[09:25] times and the llm you know learns it
[09:27] basically learns right so we you can see
[09:29] it probably better here uh and if you
[09:31] want to use your own data set you have
[09:33] to format it the same way so you know
[09:35] just having output input and
[09:37] instructions these three um parameters
[09:41] but yeah just look at this not all of
[09:43] them have the input which is fine I mean
[09:45] probably like 20% or 15% have the input
[09:49] and that's just extra context so yeah uh
[09:51] I'm also going to link this data set
[09:52] below but if you want your own data set
[09:55] which you know if you want your own use
[09:56] case just make sure to format it the
[09:58] same way so you know instruction some
[10:01] text input some extra context or empty
[10:04] and output how the model should respond
[10:07] and you know if you if you're getting
[10:09] creative you can definitely use llms to
[10:12] generate these large data sets much
[10:14] faster I mean maybe you create really
[10:17] like 20 high quality examples by hand
[10:20] and then you run a team of Agents um for
[10:23] creating that data set that can just you
[10:24] know use those 20 examples to create
[10:27] 50,000 like in this data set but yeah
[10:30] that's a topic for a whole another video
[10:32] so if you want me to make a video on how
[10:33] to make data sets for fine tuning then
[10:36] let me know but let's go back to our
[10:41] collab so then we Define a system prompt
[10:44] which is you know custom instruction
[10:45] system prompt which you already know
[10:47] hopefully that formats tasks into
[10:49] instruction inputs and responses so this
[10:51] has to fit with our data set and we
[10:54] apply it to our data set for the model
[10:57] and we add the EOS token to Signal
[11:00] completion so this token right here here
[11:02] we Define it and here here we add it
[11:04] because without this the token
[11:05] generation continues forever so we don't
[11:08] want that obviously so let's look at the
[11:11] system prompt it's very simple it says
[11:13] below is a instruction that describes a
[11:16] task paired with an input that provides
[11:18] further context WR the response
[11:22] that appropriately completes the request
[11:25] and that's our system prompt and then we
[11:27] feed it the instruction the input and
[11:30] response and obviously you can change
[11:32] the system prompt if you
[11:35] want now train the model we do a 60 step
[11:40] uh we do only 60 steps here to speed
[11:42] things up um you can like this is
[11:45] obviously very small because it's not
[11:47] even one Epoch training Epoch so uh if
[11:51] you want to like actually use something
[11:53] for production or your business you
[11:56] probably want to train it for longer
[11:57] than 60 steps and I'm going to show you
[11:59] how how in this
[12:01] bit so if you if you do multiple EO you
[12:04] have to turn Max steps none so here okay
[12:08] number number of trained eox is not
[12:11] included in here so what you would do is
[12:13] you would copy this and you would go in
[12:16] here and look at the steps right so we
[12:18] have the steps here you would add this
[12:21] maybe you would do four or
[12:24] whatever however many you want the more
[12:26] the better but at a certain point it
[12:28] starts to not yield better result so max
[12:33] steps you have to change it to none
[12:34] right so this is 60 60 right now so you
[12:37] do none and this is where you would do
[12:39] like proper fine tuning but um you know
[12:42] I just add it that 604 demonstration
[12:45] that way it's faster and it still took
[12:47] like 8 minutes so I'm not going to
[12:48] replicate it I'm just going to show it
[12:50] everything but yeah basically um you
[12:53] know this is what you do you decide how
[12:55] many EO you want and then at this stage
[12:58] we confir configuring our models
[13:00] training setup where we Define things
[13:02] like badge
[13:03] size and learning rate to teach our
[13:06] model effectively with the data we've
[13:07] prep prepared so obviously you can like
[13:10] mess with stuff here um again I'm not
[13:13] going to PR pretend I understand
[13:14] everything but the main things are you
[13:17] know backing like this can make it five
[13:19] times faster for short sequences
[13:21] obviously the steps and the epox but um
[13:26] yeah I mean if you're confused something
[13:28] just take a screenshot boom like this
[13:31] and ask sh
[13:33] GPD now this is the current memory stats
[13:36] right so we're using the Tesla T4 GPU
[13:39] provided from Google for free and the
[13:42] max memory is 14
[13:45] GB and this is where the training begins
[13:48] this is the magical part right so here
[13:51] we do this line of code trainer stats uh
[13:55] trainer. train and this will give us the
[13:57] statistics as the model trains so again
[14:00] this is only 60
[14:01] steps which is um like zero EPO but yeah
[14:07] um you can see the training loss going
[14:09] down so like basically smaller number is
[14:11] better here so you can see like at the
[14:13] start we have 1.8 2 like 1.9 and then it
[14:17] quickly starts dropping to like 0.9 you
[14:19] know around 1 0.8 so it fluctuates a bit
[14:23] but it consistently go down 0.7 but you
[14:26] can see it's reaching like a as symt
[14:28] right obviously it's only 60 steps so
[14:31] really doesn't mean anything um but yeah
[14:34] like we ended up like 0.8 from like two
[14:38] so it shows you like if the model is
[14:40] actually
[14:42] improving and this took like 8 minutes
[14:44] you can see the stats here right so 476
[14:47] seconds almost exactly 8 minutes Peak
[14:50] Reserve memory was 8.9 GB and for
[14:53] training was 3.3 GB so not like this is
[14:56] the power of unso it's like really
[14:58] optimized for for this to use uh to run
[15:01] faster and to use less memory so that
[15:04] way we can find tune gpus for cheaper I
[15:06] mean you know I'm using a free T4 GPU
[15:09] from Google so it's free but it's faster
[15:12] like if you didn't use unso it would be
[15:13] a lot
[15:15] slower so okay so 60% of we used 60% of
[15:19] max memory so that's good because we
[15:21] didn't like hit the limit so we still
[15:23] have like 40% reserved and for training
[15:27] uh it was only 22% which is even better
[15:30] inference which is which means here we
[15:33] actually run our new model that we
[15:34] fine-tuned and okay so this data set is
[15:38] for like instructions and this is
[15:39] basically when you see a model that is
[15:42] like instruct at the end of it this is
[15:44] what they mean it's just trained on a
[15:46] large data set of instructions because
[15:48] usually the models are more for like
[15:50] chatting for text generation you know
[15:52] you give it some input and it's like
[15:54] gives you some output it's you know for
[15:56] more conversational here for
[15:58] instructions for instruct models is to
[16:00] follow instructions you give it a task
[16:02] and it completes it so like we can see
[16:04] it probably here in vs code like rewrite
[16:07] the sentence to change its meaning and
[16:08] then output the Fe
[16:10] escaped compar to dat sub so this is
[16:13] like all tasks it's all in instructions
[16:15] and then it shows how the model should
[16:17] do it
[16:19] so let's look at it right so now we've
[16:22] trained the model this took like 8
[16:24] minutes to do so all of you can do this
[16:26] the beauty of using a Google cloud is
[16:28] that obviously ly it doesn't matter what
[16:30] machine you have even if you have a
[16:31] terrible computer this will take the
[16:33] exact same time because you're using the
[16:35] GPU and
[16:37] Cloud so obviously here you can change
[16:40] your prompt I mean this is you know I
[16:42] changed the prompts here so this is my
[16:44] prompt uh but always make sure to leave
[16:46] the output blank so here the first one
[16:48] is the instruction then this is the
[16:50] input like the extra added context and
[16:52] the output leave it blank because the
[16:54] model will generate it right so list the
[16:57] prime numbers contained within this
[16:58] range and then the range is here in the
[17:01] input 1 to 50 and then the model our new
[17:05] findun the Lama 3 generates the output
[17:07] so let's look at this 2 3 5 7 11 13 17
[17:12] 19 23 29 and just by looking at it uh
[17:15] you can see it's correct I mean none of
[17:17] these numbers are divisible so yeah this
[17:19] is correct all of them are prime
[17:21] numbers and also this is this is even
[17:24] better like I think this is much more
[17:27] visible using text streamer for
[17:29] continuous inference and I'm I'm just
[17:31] going to show it again by the way this
[17:33] is how it looks right so you have the
[17:35] instructions it's separated but that's
[17:37] not the main thing not only is it
[17:38] formatted better it's uh continuous
[17:40] inference so you can see the token
[17:41] generation token by token instead of
[17:43] waiting for the whole time so if I run
[17:46] this as you can see it waits and it
[17:48] generates it all at once right so boom
[17:51] it like appeared all at once so if you
[17:53] want to see it token by token this is
[17:55] much better right look at how fast it is
[17:58] this is the power of llama 3 8 billion a
[18:00] small model but a very capable model so
[18:04] um yeah Tech streamer is great for this
[18:06] and you can see it how it's generating
[18:08] the
[18:09] answer so yeah this is um the next
[18:12] prompt I Ed myself convert these binary
[18:15] numbers to decimal and then here and by
[18:17] the way again you can use these proms
[18:19] like example create like 20 30 by hand
[18:21] maybe and then you know feed this into
[18:24] CH GPD or your team of Agents something
[18:27] automated ideally or there is I think
[18:29] there's a service for like like
[18:31] reflection AI or something like that but
[18:33] yeah either way you can creating large
[18:35] data sets need to needs to be automated
[18:37] right so you cannot do that by hand but
[18:39] either way like you create something
[18:41] like this so uh these examples and then
[18:44] you would feed that into your own data
[18:45] set obviously relevant to your use case
[18:48] to your business and you just go crazy
[18:52] and create as many as as you possibly
[18:53] can
[18:55] so like really at least 1,000 like this
[18:58] is 50,000 and it's still probably could
[19:01] be larger so yeah I mean you have to
[19:04] again that's probably another video to
[19:06] build a team of agents to um generate
[19:09] data sets but yeah okay so here we give
[19:12] it um three different binary numbers and
[19:15] it tasks its tasks is to convert to
[19:18] decimal and as you can see it does it
[19:20] flawlessly I mean this is 10 13 15 that
[19:23] is correct so we have the model we
[19:25] tested it a bit with two prompts now
[19:28] it's time to to save it and because you
[19:30] know we spend all this time all this GPU
[19:32] power training it we don't want to go
[19:34] with to waste because if you restart the
[19:36] run time in Google collab um your model
[19:39] will disappear obviously you can run it
[19:41] again but then you have to wait again
[19:43] and you know maybe run out of the three
[19:45] GPU hours so to save the final model as
[19:48] Lura adapters we can either use hugging
[19:51] face push to hub for an online save if
[19:54] you wanted your model listed on hugging
[19:56] phase so hugging phase lists data assets
[19:59] and models it's probably the main two
[20:00] things it's used for so if you want your
[20:03] model shared then you would do that but
[20:05] if you want it just Sav on your computer
[20:07] do safe pre-train for a local
[20:10] save by the way this only saves the
[20:13] loraa adapters meaning um the like
[20:16] basically the things that were changed
[20:18] it doesn't save the entire model with
[20:19] the change parameters just the changes
[20:21] right so uh it's less memory and yeah
[20:25] just faster to save so but if you want
[20:28] to save the L adapters with the save
[20:30] model uh you can change this if you want
[20:33] to load the L adapters we saved for
[20:36] inference you would change this false to
[20:38] true so simply changing
[20:42] this and yeah this is the model name so
[20:45] obviously you can change this this is
[20:47] your model used for training uh here is
[20:49] just laa model but you can name it lava
[20:51] free I don't know copyrighting or Lama
[20:53] free uh medical diagnosis whatever your
[20:57] um use case is obviously and then
[21:01] um here the alpaka prompt so yeah this
[21:04] is the variable we declared earlier so
[21:07] this is the importance you can just go
[21:08] into the collab and try to running this
[21:10] cell you have to run the cells from
[21:12] above otherwise this will not work so
[21:14] whenever you're using a jupyter notebook
[21:16] such as Google collab always run all
[21:19] cells in order otherwise it will not
[21:21] work so
[21:23] yeah so this is the same uh format right
[21:27] from earlier inst ction input output at
[21:30] this point you should be familiar with
[21:31] this and that's just for this particular
[21:33] data set and for this style of prompting
[21:36] so if you have a different one then
[21:38] follow the different one so obviously
[21:40] here U what is the famous St Tower in
[21:43] Paris obviously it's Eiffel Tower blah
[21:45] blah blah it gives some extra info about
[21:48] it so you can also use hugging face Auto
[21:51] model for perf casual LM but ANS slof
[21:54] does not recommend this because it's a
[21:56] lot slower than ANS slof
[21:59] so yeah uh if possible use unslow for
[22:02] Speed and as the name suggests of you
[22:04] know unlove it's UNS slowing everything
[22:07] it's making everything two to five times
[22:09] faster so why not do that with 80% less
[22:12] memory so
[22:13] yeah okay and then we're preparing to
[22:16] save our trained model in a more compact
[22:18] format and then upload it into a cloud
[22:20] platform which allows for Less storage
[22:23] and comparation power so again like I'm
[22:25] not going to even pretend I understand
[22:26] everything because this is honestly
[22:28] stepping outside of my comfort zone but
[22:31] like just building this and doing this
[22:33] fine tuning taught me a lot so if you
[22:35] want more technical videos like this let
[22:37] me know next we're ready to compress our
[22:40] model using various quantization methods
[22:42] which means just you know making it
[22:44] easier to run or a machine so maybe you
[22:47] cannot like if you have a bad computer
[22:49] maybe you cannot run the full model but
[22:51] you definitely can run a quantized
[22:52] version of it it makes it leaner and
[22:55] then uh we upload it to the cloud for
[22:58] easy sh in this is what this piece of
[23:00] code does
[23:03] and so we use the model un. GF file or
[23:08] the quantise version so the Q4 means
[23:10] quanti in Lama CCP or if you want a UI
[23:14] based system which probably you do which
[23:16] is easier to use you can use like GPT
[23:18] for or or um is the other one is
[23:21] escaping me but yeah these are basically
[23:24] these USB system that you can use
[23:26] to or llm anything yeah I don't know if
[23:29] this supports it but yeah basically
[23:31] these uh these U Frameworks have a UI
[23:34] that's easy to chat with and you can use
[23:37] open source model there so if you do if
[23:40] you find this you can upload this to GPD
[23:42] for all and chat with your own model
[23:44] very easily and yeah that's it you know
[23:47] how to fine tune Lama 3 for your your
[23:49] own specific use case again I'm going to
[23:52] leave these resources below the video
[23:55] and if you have any questions regarding
[23:57] to anso join the their Discord so yeah
[24:00] that's it if you find this useful then
[24:03] please subscribe and again if you want
[24:05] during April which is what like eight
[24:07] days nine days left if you join the
[24:09] community you will get a personalized AI
[24:11] strategy to Future prooof yourself and
[24:13] your business so if that sounds valuable
[24:15] to you then make sure to join it's the
[24:17] first link in the description thank you
[24:19] for watching
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.