TubeSum ← Transcribe a video

EASIEST Way to Fine-Tune a LLM and Use It With Ollama

Transcribed Jun 14, 2026 Watch on YouTube ↗
Intermediate 2 min read For: Developers and ML enthusiasts with basic knowledge of Python and LLMs.
809.4K
Views
25.4K
Likes
462
Comments
501
Dislikes
3.2%
📈 Moderate

AI Summary

This video demonstrates how to fine-tune a large language model (LLM) using Unsloth and Llama 3.1, then run it locally with Ollama. The project focuses on creating a model that generates SQL from table data, using the synthetic text-to-SQL dataset.

[0:14]
Importance of dataset selection

Choosing a relevant dataset allows a small LLM to outperform larger models on specific tasks.

[0:26]
Project goal

Create a small, fast LLM that generates SQL based on provided table data.

[0:33]
Dataset used

Synthetic text-to-SQL dataset with over 105,000 records, including prompt, SQL content, complexity, and more.

[1:01]
Tools used

Unsloth for efficient fine-tuning (80% less memory) and Llama 3.1 as the base model.

[1:48]
Setup steps

Install dependencies, create a conda environment, install PyTorch, CUDA, Unsloth, and Jupyter.

[1:55]
Loading the model

Import FastLanguageModel from Unsloth, specify Llama 3.1 8-bit, max sequence length 2048, load in 4-bit.

[2:35]
PEFT and LoRA adapters

Load PEFT model with LoRA adapters to update only 1-10% of parameters, reducing training cost.

[3:02]
Data formatting

Format the dataset into Alpaca prompt style for Llama 3.1, focusing on SQL, prompt, and explanation.

[3:34]
Training setup

Use SFTTrainer from Hugging Face with parameters like max steps, seed, and warmup steps.

[4:00]
Conversion and Ollama deployment

Convert the trained model using Unsloth's one-liner, create a Modelfile, and run with Ollama.

By following these steps, you can fine-tune an LLM locally and deploy it with Ollama, enabling use via an OpenAI-compatible API.

Clickbait Check

90% Legit

"The title accurately reflects the content: a straightforward guide to fine-tuning an LLM and using it with Ollama."

Mentioned in this Video

Tutorial Checklist

1 1:18 Install Anaconda and CUDA libraries (CUDA 12.1, Python 3.10).
2 1:33 Create a new conda environment and install PyTorch, CUDA, Unsloth, and Jupyter.
3 1:48 Launch Jupyter notebook and verify installed packages.
4 1:55 Import FastLanguageModel from Unsloth and load Llama 3.1 8-bit model with max_seq_length=2048 and load_in_4bit=True.
5 2:35 Load PEFT model with LoRA adapters using Unsloth's recommended settings.
6 3:02 Format your dataset into Alpaca prompt style for Llama 3.1.
7 3:34 Set up SFTTrainer with parameters like max_steps, seed, and warmup_steps, then train the model.
8 4:00 Convert the trained model using Unsloth's one-liner.
9 4:15 Create a Modelfile with a system prompt (e.g., 'You are an SQL generator...').
10 4:38 Run 'ollama create' command to create the model, then use it locally.

Study Flashcards (7)

What is the main benefit of using Unsloth for fine-tuning?

easy Click to reveal answer

It reduces memory usage by about 80%.

1:06

What dataset is used in the video for fine-tuning?

easy Click to reveal answer

Synthetic text-to-SQL dataset with over 105,000 records.

0:33

What is the max sequence length set for the model?

easy Click to reveal answer

2048 tokens.

2:02

What does setting 'load_in_4bit' to true do?

medium Click to reveal answer

It uses fewer bits (4-bit) to represent model information, reducing memory usage and load.

2:15

What are LoRA adapters and why are they used?

medium Click to reveal answer

LoRA adapters allow updating only 1-10% of model parameters instead of retraining the whole model, saving time and resources.

2:40

What prompt format does Llama 3.1 use?

medium Click to reveal answer

Alpaca prompt format.

3:13

What command is used to create the Ollama model from the Modelfile?

hard Click to reveal answer

ollama create (the exact command is not fully shown, but it reads the Modelfile).

4:40

💡 Key Takeaways

💡

Dataset importance

Highlights that a small model with a relevant dataset can outperform large models.

0:14
📊

Unsloth efficiency

Unsloth reduces memory usage by 80%, making fine-tuning accessible on consumer hardware.

1:01
🔧

LoRA adapters

Explains parameter-efficient fine-tuning, a key technique for reducing training cost.

2:40
🔧

Alpaca prompt format

Shows the need to format data correctly for the model, a common practical step.

3:13
🔧

One-liner conversion

Demonstrates Unsloth's simplicity in converting models for Ollama deployment.

4:00

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Fine-tune LLM locally with Ollama

45s

Shows a practical, step-by-step guide to fine-tuning an LLM on your own machine, appealing to developers and AI enthusiasts.

▶ Play Clip

Unsloth cuts memory usage by 80%

60s

Highlights a tool that dramatically reduces memory requirements, making fine-tuning accessible to more people.

▶ Play Clip

Load Llama 3.1 in 4-bit mode

51s

Explains how to reduce model memory usage with 4-bit quantization, a key technique for running LLMs on consumer hardware.

▶ Play Clip

Format dataset for fine-tuning

60s

Demonstrates the crucial step of formatting data for Llama 3.1's Alpaca prompt style, a common pain point for beginners.

▶ Play Clip

Run fine-tuned model with Ollama

57s

Shows the final step to deploy the model locally, giving viewers a complete end-to-end workflow.

▶ Play Clip

[00:00] you want to fine tune your large language

[00:03] using Ollama.

[00:07] Well, in today's video,

[00:09] So let's go.

[00:12] First, for the fun part,

[00:14] The reason why finding the right data

[00:18] a small, large language model

[00:21] that is relevant to the task

[00:23] It can actually outperform large models.

[00:26] What I'm going to be doing

[00:29] that will generate

[00:32] I provide it.

[00:33] One of the biggest data sets to do this

[00:37] which has over 105,000 records split

[00:41] into columns of prompt SQL

[00:45] Im running a Nvidia 4090 GPU,

[00:47] so I'm going to be fine

[00:50] If you don't have a GPU,

[00:53] which allows you to run

[00:56] The great news is that this project

[00:57] does not require a lot of complex

[01:01] We're going to be using Unsloth

[01:02] which allows you to fine tune

[01:06] really efficiently,

[01:09] And we're going to be using Llama 3.1

[01:13] research purposes, especially in English,

[01:18] Make sure that you have Anaconda

[01:19] installed on your machine

[01:22] I will be using Cuda 12.1 and Python 3.10

[01:26] You want to install

[01:27] the dependencies required by Unsloth

[01:31] But for simplicity, here it is.

[01:33] This creates a new environment for us

[01:35] and installed PyTorch Cuda libraries

[01:39] You'll also want to install Jupyter

[01:42] and then run your Jupyter notebook.

[01:44] And now you're done with the setup.

[01:46] So let's go into the Jupyter

[01:48] First, we want to make sure that all the

[01:52] If you're using Google Colab,

[01:55] Next we're going to import the fast

[01:58] Here we're specifying that we want to use

[02:02] We also want to set up a max sequence

[02:06] This means that the model

[02:10] where a token can be a word, subword

[02:15] When processing or generating text,

[02:19] which essentially means we're using less

[02:23] or 32 bits

[02:26] Doing this is going to help you

[02:28] reduce memory usage

[02:31] After running this,

[02:32] you're going to get a cute Ascii image,

[02:35] After this,

[02:36] we're going to load in the PEFT model,

[02:40] If you don't know what these terms

[02:43] Basically, the LORA adapters mean that

[02:45] we only have to update 1

[02:49] Without them, it means that

[02:50] we would have to retrain the whole model,

[02:54] which takes a lot of time, energy,

[02:57] Unsworth provides

[03:00] I trust them,

[03:02] Now this is where things can get

[03:06] set you're using.

[03:06] The each data

[03:09] but they're each formatted in the same way

[03:12] can understand it.

[03:13] Llama three uses alpaca prompts

[03:17] Now, if you remember our data set,

[03:21] and letting it go off to the races.

[03:22] I have to format my response

[03:26] I'm only interested

[03:28] The prompt I will be asking for, as well

[03:32] So I'm going to update my code

[03:34] Now we set up the training module

[03:37] Trainer by hugging face is what I used.

[03:39] There are a lot of parameters to use all

[03:43] So for example, have max steps which tells

[03:47] Seed is a random number generator.

[03:48] We used to be able to reproduce results

[03:53] learning rate over time.

[03:54] So now that we have everything

[03:59] And that's it.

[04:00] Your model has been trained.

[04:01] Now before we move on,

[04:02] we actually need

[04:03] to convert this into the right file type

[04:06] using a llama.

[04:07] Luckily, onslaught has a one liner

[04:10] After this is done, we only need to do one

[04:15] First, open up your terminal.

[04:16] I'm using the warp terminal here.

[04:18] Go to the path of where the file is saved.

[04:21] Then create a file called Model file

[04:25] This is Ollamas Docker

[04:28] where we can create new models

[04:30] In our model file

[04:32] So something like you're an SQL generator

[04:36] and gives them helpful SQL to use.

[04:38] Finally make sure Olan was running.

[04:40] And then we're just going to run

[04:42] This command will then read all the items

[04:46] and start using llama Dhcp under the hood

[04:50] on your machine. And congrats!

[04:51] You can now use your fine tuned

[04:54] all with the OpenAI compatible API

[05:01] If you're

[05:01] curious to know more about Alama,

[05:05] out about everything you need to know

[05:08] Otherwise, thank you for watching

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.