Fine-Tuning Llama 3 on a Custom Dataset: Training LLM for a RAG Q&A Use Case on a Single GPU

Transcribed Jun 16, 2026 Watch on YouTube ↗

Intermediate 12 min read For: Machine learning engineers and data scientists with basic experience in transformers and fine-tuning, interested in practical LLM customization.

55.2K

Views

1.1K

Likes

40

Comments

46

Dislikes

2.0%

📊 Average

AI Summary

This video demonstrates how to fine-tune a Llama 3 8B instruct model on a custom financial dataset for a RAG Q&A application, all on a single GPU. The process covers dataset preparation, LoRA adapter setup, training, evaluation, and model deployment to Hugging Face Hub.

Chapters

1 Introduction and Overview 0:00 2 Dataset Preparation 1:01 3 Base Model Evaluation 8:19 4 Training Setup and Configuration 14:33 5 Training and Monitoring 23:08 6 Model Saving, Merging, and Deployment 26:32 7 Results and Comparison 28:45

[0:55]

Fine-tuning process overview

Steps include building a dataset from custom prompts, evaluating base model performance, setting up a LoRA adapter, training, evaluating on a test set, merging adapters, and pushing to Hugging Face Hub.

[2:21]

Dataset: Financial Q&A 10K

Uses the Financial Q&A 10K dataset from Hugging Face, containing ~7,000 examples with question, context, and answer columns. The dataset is financial in nature.

[2:58]

Base model: Llama 3 8B Instruct

Uses Meta AI's Llama 3 8B instruct model, quantized to 4-bit for single GPU training. It has an 8K token context length and a built-in chat template.

[3:56]

Hardware and libraries

Fine-tuning performed on a T4 GPU (16GB VRAM). Libraries used: PyTorch, Transformers, Datasets, Accelerate, bitsandbytes, PEFT, TRL (SFTTrainer), and evaluate.

[5:30]

Model loading and tokenizer setup

Model loaded in 4-bit using NF4 quantization. A padding token is added to the tokenizer to prevent generation issues during training.

[8:19]

Custom dataset creation

Converts the Hugging Face dataset to a DataFrame, formats examples using the chat template, counts tokens, filters out examples >512 tokens, and splits into train/validation/test sets.

[14:33]

Base model evaluation before fine-tuning

Tests the base model on 100 test examples. The base model is verbose and often produces bullet points or extra text, not matching the concise answers in the dataset.

[17:36]

Data collator for completion-only loss

Uses a data collator that masks input tokens with -100 so loss is calculated only on the generated completion, not the prompt. This speeds up training and improves results.

[19:05]

LoRA configuration

Targets all linear layers in the Llama architecture (query, key, value, MLP). LoRA rank=32, alpha=16. Only 1.34% of parameters (~84 million) are trained.

[23:08]

Training configuration

Max tokens=512, 1 epoch, batch size=2 with gradient accumulation steps=4 (effective batch size=8). Uses 8-bit AdamW optimizer, evaluates every 20% of training, warmup ratio=10%.

[26:32]

Model saving and merging

After training, the LoRA adapter is merged with the base model on a P100 GPU (T4 memory insufficient). The merged model is pushed to Hugging Face Hub as 'Llama-3-8B-Instruct-Finance-R'.

[28:45]

Comparison: fine-tuned vs base model

The fine-tuned model produces much shorter, more concise answers that match the dataset format. The base model remains verbose and adds unnecessary formatting.

Fine-tuning Llama 3 8B on a custom financial dataset significantly improves response quality, making it more concise and aligned with the desired format. The process is feasible on a single GPU using LoRA and 4-bit quantization.

Mentioned in this Video

PyTorch

tool

Transformers

tool

Datasets

tool

Accelerate

tool

bitsandbytes

tool

PEFT

tool

TRL (SFTTrainer)

tool

evaluate

tool

TensorBoard

tool

Hugging Face Hub

service

Philip Schmidt

person

Blog post: How to fine-tune LLMs in 2024 with Hugging Face

link

Financial Q&A 10K

dataset

Llama-3-8B-Instruct-Finance-R

model

Tutorial Checklist

1 0:55 Build a dataset from custom prompts (e.g., JSON file) and transform into Hugging Face dataset.

2 1:14 Choose and evaluate the initial performance of the base model (Llama 3 8B Instruct).

3 1:23 Set up a LoRA adapter for parameter-efficient fine-tuning.

4 1:46 Train the model and monitor the training process (approx. 2 hours on a single GPU).

5 1:58 Evaluate the fine-tuned model on a previously created test set.

6 2:06 Merge the LoRA adapter with the base model and push the merged model to Hugging Face Hub.

7 3:56 Install required libraries: torch, transformers, datasets, accelerate, bitsandbytes, peft, trl, evaluate.

8 5:30 Load the base model with 4-bit quantization (NF4) and add a padding token to the tokenizer.

9 8:19 Convert the dataset to a DataFrame, format examples using the chat template, count tokens, filter out examples >512 tokens, and split into train/validation/test sets.

10 17:36 Use a data collator that masks input tokens with -100 to calculate loss only on the completion.

11 19:05 Configure LoRA: target all linear layers, rank=32, alpha=16.

12 23:08 Set training arguments: max tokens=512, 1 epoch, batch size=2, gradient accumulation steps=4, 8-bit AdamW optimizer, evaluate every 20% of training, warmup ratio=10%.

13 26:32 Save the LoRA adapter, merge with base model on a P100 GPU, and push to Hugging Face Hub with max shard size of 5GB.

Study Flashcards (15)

What is the name of the dataset used for fine-tuning?

easy Click to reveal answer

Financial Q&A 10K