Fine-tune LLM locally with Ollama
45sShows a practical, step-by-step guide to fine-tuning an LLM on your own machine, appealing to developers and AI enthusiasts.
▶ Play ClipThis video demonstrates how to fine-tune a large language model (LLM) using Unsloth and Llama 3.1, then run it locally with Ollama. The project focuses on creating a model that generates SQL from table data, using the synthetic text-to-SQL dataset.
Choosing a relevant dataset allows a small LLM to outperform larger models on specific tasks.
Create a small, fast LLM that generates SQL based on provided table data.
Synthetic text-to-SQL dataset with over 105,000 records, including prompt, SQL content, complexity, and more.
Unsloth for efficient fine-tuning (80% less memory) and Llama 3.1 as the base model.
Install dependencies, create a conda environment, install PyTorch, CUDA, Unsloth, and Jupyter.
Import FastLanguageModel from Unsloth, specify Llama 3.1 8-bit, max sequence length 2048, load in 4-bit.
Load PEFT model with LoRA adapters to update only 1-10% of parameters, reducing training cost.
Format the dataset into Alpaca prompt style for Llama 3.1, focusing on SQL, prompt, and explanation.
Use SFTTrainer from Hugging Face with parameters like max steps, seed, and warmup steps.
Convert the trained model using Unsloth's one-liner, create a Modelfile, and run with Ollama.
By following these steps, you can fine-tune an LLM locally and deploy it with Ollama, enabling use via an OpenAI-compatible API.
"The title accurately reflects the content: a straightforward guide to fine-tuning an LLM and using it with Ollama."
What is the main benefit of using Unsloth for fine-tuning?
It reduces memory usage by about 80%.
1:06
What dataset is used in the video for fine-tuning?
Synthetic text-to-SQL dataset with over 105,000 records.
0:33
What is the max sequence length set for the model?
2048 tokens.
2:02
What does setting 'load_in_4bit' to true do?
It uses fewer bits (4-bit) to represent model information, reducing memory usage and load.
2:15
What are LoRA adapters and why are they used?
LoRA adapters allow updating only 1-10% of model parameters instead of retraining the whole model, saving time and resources.
2:40
What prompt format does Llama 3.1 use?
Alpaca prompt format.
3:13
What command is used to create the Ollama model from the Modelfile?
ollama create (the exact command is not fully shown, but it reads the Modelfile).
4:40
Dataset importance
Highlights that a small model with a relevant dataset can outperform large models.
0:14Unsloth efficiency
Unsloth reduces memory usage by 80%, making fine-tuning accessible on consumer hardware.
1:01LoRA adapters
Explains parameter-efficient fine-tuning, a key technique for reducing training cost.
2:40Alpaca prompt format
Shows the need to format data correctly for the model, a common practical step.
3:13One-liner conversion
Demonstrates Unsloth's simplicity in converting models for Ollama deployment.
4:00[00:00] you want to fine tune your large language
[00:03] using Ollama.
[00:07] Well, in today's video,
[00:09] So let's go.
[00:12] First, for the fun part,
[00:14] The reason why finding the right data
[00:18] a small, large language model
[00:21] that is relevant to the task
[00:23] It can actually outperform large models.
[00:26] What I'm going to be doing
[00:29] that will generate
[00:32] I provide it.
[00:33] One of the biggest data sets to do this
[00:37] which has over 105,000 records split
[00:41] into columns of prompt SQL
[00:45] Im running a Nvidia 4090 GPU,
[00:47] so I'm going to be fine
[00:50] If you don't have a GPU,
[00:53] which allows you to run
[00:56] The great news is that this project
[00:57] does not require a lot of complex
[01:01] We're going to be using Unsloth
[01:02] which allows you to fine tune
[01:06] really efficiently,
[01:09] And we're going to be using Llama 3.1
[01:13] research purposes, especially in English,
[01:18] Make sure that you have Anaconda
[01:19] installed on your machine
[01:22] I will be using Cuda 12.1 and Python 3.10
[01:26] You want to install
[01:27] the dependencies required by Unsloth
[01:31] But for simplicity, here it is.
[01:33] This creates a new environment for us
[01:35] and installed PyTorch Cuda libraries
[01:39] You'll also want to install Jupyter
[01:42] and then run your Jupyter notebook.
[01:44] And now you're done with the setup.
[01:46] So let's go into the Jupyter
[01:48] First, we want to make sure that all the
[01:52] If you're using Google Colab,
[01:55] Next we're going to import the fast
[01:58] Here we're specifying that we want to use
[02:02] We also want to set up a max sequence
[02:06] This means that the model
[02:10] where a token can be a word, subword
[02:15] When processing or generating text,
[02:19] which essentially means we're using less
[02:23] or 32 bits
[02:26] Doing this is going to help you
[02:28] reduce memory usage
[02:31] After running this,
[02:32] you're going to get a cute Ascii image,
[02:35] After this,
[02:36] we're going to load in the PEFT model,
[02:40] If you don't know what these terms
[02:43] Basically, the LORA adapters mean that
[02:45] we only have to update 1
[02:49] Without them, it means that
[02:50] we would have to retrain the whole model,
[02:54] which takes a lot of time, energy,
[02:57] Unsworth provides
[03:00] I trust them,
[03:02] Now this is where things can get
[03:06] set you're using.
[03:06] The each data
[03:09] but they're each formatted in the same way
[03:12] can understand it.
[03:13] Llama three uses alpaca prompts
[03:17] Now, if you remember our data set,
[03:21] and letting it go off to the races.
[03:22] I have to format my response
[03:26] I'm only interested
[03:28] The prompt I will be asking for, as well
[03:32] So I'm going to update my code
[03:34] Now we set up the training module
[03:37] Trainer by hugging face is what I used.
[03:39] There are a lot of parameters to use all
[03:43] So for example, have max steps which tells
[03:47] Seed is a random number generator.
[03:48] We used to be able to reproduce results
[03:53] learning rate over time.
[03:54] So now that we have everything
[03:59] And that's it.
[04:00] Your model has been trained.
[04:01] Now before we move on,
[04:02] we actually need
[04:03] to convert this into the right file type
[04:06] using a llama.
[04:07] Luckily, onslaught has a one liner
[04:10] After this is done, we only need to do one
[04:15] First, open up your terminal.
[04:16] I'm using the warp terminal here.
[04:18] Go to the path of where the file is saved.
[04:21] Then create a file called Model file
[04:25] This is Ollamas Docker
[04:28] where we can create new models
[04:30] In our model file
[04:32] So something like you're an SQL generator
[04:36] and gives them helpful SQL to use.
[04:38] Finally make sure Olan was running.
[04:40] And then we're just going to run
[04:42] This command will then read all the items
[04:46] and start using llama Dhcp under the hood
[04:50] on your machine. And congrats!
[04:51] You can now use your fine tuned
[04:54] all with the OpenAI compatible API
[05:01] If you're
[05:01] curious to know more about Alama,
[05:05] out about everything you need to know
[05:08] Otherwise, thank you for watching
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.