TubeSum ← Transcribe a video

RAG vs Fine-Tuning vs Prompt Engineering: Optimizing AI Models

Transcribed Jun 18, 2026 Watch on YouTube ↗
Beginner 6 min read For: Anyone interested in understanding how to improve AI model outputs, from beginners to practitioners.
668.8K
Views
12.1K
Likes
217
Comments
27
Dislikes
1.8%
📊 Average

AI Summary

The video explores how to improve the responses of large language models (LLMs) by comparing three key techniques: RAG, fine-tuning, and prompt engineering. It uses the example of asking an LLM 'Who is Martin Keen?' to illustrate how different models give different answers due to varying training data. The video then explains each method, its benefits, and its drawbacks.

[0:18]
Model Responses Vary

Different LLMs give different answers to the same question because they have different training data sets and knowledge cutoff dates.

[1:03]
RAG Definition

RAG stands for Retrieval Augmented Generation. It retrieves external up-to-date information, augments the original prompt with it, and then generates a response based on the enriched context.

[3:07]
Vector Embeddings in RAG

RAG converts both the query and documents into vector embeddings, which capture meaning mathematically, allowing it to find semantically similar information even without exact keyword matches.

[4:40]
RAG Costs

RAG adds latency and requires maintaining a vector database, increasing processing and infrastructure costs.

[5:20]
Fine-Tuning Process

Fine-tuning takes an existing model and gives it additional specialized training on a focused dataset, updating its internal parameters (weights) through supervised learning.

[7:22]
Fine-Tuning Advantages and Disadvantages

Fine-tuning is faster at inference time than RAG and doesn't require a separate vector database, but it requires thousands of high-quality training examples and significant computational resources.

[8:37]
Catastrophic Forgetting

Catastrophic forgetting is a risk where the model loses some of its general capabilities while learning specialized ones during fine-tuning.

[8:48]
Prompt Engineering Basics

Prompt engineering involves crafting prompts to better guide the model's attention by including examples, context, or desired format, without changing the model or adding data.

[10:26]
Prompt Engineering Benefits and Limitations

Prompt engineering offers immediate results and no infrastructure changes, but it cannot teach the model truly new information and requires trial and error.

[11:57]
Combining Methods

The three methods are often used in combination. For example, a legal AI system might use RAG for recent cases, prompt engineering for formatting, and fine-tuning for firm-specific policies.

Clickbait Check

90% Legit

"The title accurately reflects the content, which compares and explains all three techniques in detail."

Mentioned in this Video

Study Flashcards (9)

What does RAG stand for?

easy Click to reveal answer

Retrieval Augmented Generation

1:03

How does RAG find relevant information beyond keyword matching?

medium Click to reveal answer

It converts both the query and documents into vector embeddings that capture meaning mathematically.

3:07

What are the main costs associated with RAG?

medium Click to reveal answer

It adds latency to each query and requires maintaining a vector database, increasing processing and infrastructure costs.

4:40

How does fine-tuning modify a pre-trained model?

medium Click to reveal answer

It updates the model's internal parameters (weights) through additional training on a specialized dataset.

5:58

What are the advantages of fine-tuning over RAG?

medium Click to reveal answer

It is faster at inference time because it doesn't need to search external data, and it doesn't require a separate vector database.

7:22

What is a major risk of fine-tuning?

hard Click to reveal answer

Catastrophic forgetting, where the model loses some of its general capabilities while learning specialized ones.

8:37

What is prompt engineering?

easy Click to reveal answer

It involves crafting prompts to better guide the model's attention by including examples, context, or desired format.

8:48

What is a key limitation of prompt engineering?

medium Click to reveal answer

It cannot teach the model truly new information; it can only better access existing knowledge.

11:15

How do the three methods complement each other?

hard Click to reveal answer

Prompt engineering offers flexibility and immediate results, RAG extends knowledge with up-to-date info, and fine-tuning enables deep domain expertise.

12:27

💡 Key Takeaways

💡

Model Responses Vary by Training Data

Highlights the fundamental reason why different LLMs give different answers to the same question.

0:22
🔧

RAG Explained

Provides a clear, step-by-step breakdown of Retrieval Augmented Generation.

1:03
🔧

Fine-Tuning Process

Explains how fine-tuning modifies model weights through supervised learning.

5:20
🔧

Prompt Engineering Basics

Demonstrates how simple clarification can improve model output.

8:48
⚖️

Combining Methods

Shows that these techniques are often used together for optimal results.

11:57

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Googling yourself with AI?

44s

Relatable hook about modern self-search behavior grabs viewers instantly.

▶ Play Clip

RAG's secret: Vector embeddings explained

47s

Demystifies complex AI concept with simple analogy, perfect for curious tech enthusiasts.

▶ Play Clip

Fine-tuning: Make AI a domain expert

48s

Shows practical application of fine-tuning, appealing to developers and businesses.

▶ Play Clip

Master prompt engineering in minutes

54s

Actionable tip that requires no technical setup, highly shareable.

▶ Play Clip

Combine all three AI optimization methods

51s

Provides strategic insight for optimizing AI systems, valuable for professionals.

▶ Play Clip

[00:00] Remember how back in the day people would

[00:03] Google themselves, you type your name into a search engine and you see what it knows about you?

[00:08] Well, the modern equivalent of that is to do the same thing with a chatbot.

[00:13] So when I ask a large language model, who is Martin Keen?

[00:18] Well, the response varies greatly depending upon which model I'm asking,

[00:22] because different models, they have different training data sets, they have a different knowledge cutoff dates.

[00:28] So what a given model knows about me, well, it differs greatly.

[00:32] But how could we improve the model's answer?

[00:36] Well, there's three ways.

[00:38] So let's start with a model here, and we're gonna see how we can improve its responses.

[00:44] Well, the first thing it could do is it could go out and it could perform a search,

[00:51] a search for new data that either wasn't in its training data set,

[00:54] or it was just data that became available after the model finished training,

[00:58] and then it could incorporate those results from the search back into its answer.

[01:03] That is called RAG or Retrieval Augmented Generation.

[01:11] That's one method.

[01:12] Or we could pick a specialized model, a model that's been trained on, let's say, transcripts of these videos.

[01:21] That would be an example of something called fine tuning,

[01:29] or we could ask the model a query that better specifies what we're looking for.

[01:36] So maybe the LLM already knows plenty about the Martin Keens of the world,

[01:41] but let's tell the model that we're referring to the Martin keen who works at IBM,

[01:45] rather than the Martin Keen that founded Keen Shoes.

[01:50] That is an example of prompt engineering.

[01:55] Three ways to get better outputs out of large language models, each with their pluses and minuses.

[02:03] Let's start with RAG.

[02:05] So let's break it down.

[02:06] First there's retrieval.

[02:08] So retrieval of external up-to-date information.

[02:12] Then there's augmentation.

[02:14] That's augmentation of the original prompt with the retrieved information added in.

[02:19] And then finally there's generation.

[02:22] That's generation of a response based on all of this enriched context.

[02:27] So we can think of it like this.

[02:30] So we start with a query and the query comes in to a large language model.

[02:40] Now, what RAG is gonna do is it's first going to go searching through a corpus of information.

[02:48] So we have this corpus here full of some sort of data.

[02:53] Now, perhaps, that's your organization's documents.

[02:56] So it might be spreadsheets, PDFs, internal wikis, you know, stuff like that,

[03:01] But unlike a typical search engine that just matches keywords,

[03:07] RAG converts both your question, the query, and all of the documents into something called vector embeddings.

[03:18] So these are all converted into vectors.

[03:20] essentially turning words and phrases into long lists of numbers that capture their meaning.

[03:27] So when you ask a query like, what was our company's revenue growth last quarter?

[03:34] Well, RAG will find documents that are mathematically similar in meaning to your question,

[03:38] even if they don't use the exact same words.

[03:41] So it might find documents mentioning fourth quarter performance or quarterly sales.

[03:48] Those don't contain the keyword revenue growth, but they are semantically similar.

[03:54] Now, once RAG finds the relevant information, it adds this information

[03:59] back into your original query before passing it to the language model.

[04:06] So instead of the model just kind of guessing based on its training data,

[04:09] it can now generate a response that incorporates your actual facts and figures.

[04:15] So this makes RAG particularly valuable when you are looking for information that is up to date,

[04:24] and it's also very valuable when you need in to add in information that is domain specific as well,

[04:34] but there are some costs to this.

[04:38] Let's go with the red pen.

[04:40] So one cost, that would be the cost of performance.

[04:45] for performing all of this, because you have this retrieval step here, and that

[04:50] adds latency to each query compared to a simple prompt to a model.

[04:55] There are also costs related to just kind of the processing of this as well.

[05:01] So if we think about what we're having to do here, we've got documents that need to be vector embeddings,

[05:07] and we need to store these vector embedding in a database.

[05:11] All of this adds to processing costs, it adds to infrastructure costs

[05:15] to make this solution work.

[05:17] All right, next up, fine tuning.

[05:20] So remember how we discussed getting better answers about me by

[05:24] training a model specifically on, let's say, my video transcripts.

[05:26] Well, that is fine tuning in action.

[05:30] So what we do with fine tuning is we take a model, but specifically an existing model.

[05:40] and that existing model has broad knowledge.

[05:44] And then we're gonna give it additional specialized training on a focused data set.

[05:51] So this is now specialized to what we want to develop particular expertise on.

[05:58] Now, during fine tuning, we're updating the model's internal parameters through additional training.

[06:05] So the model starts out with some weights here.

[06:10] like this, and those weights were optimized during its initial pre-training.

[06:16] And as we fine tune, we're making small adjustments here to the model's weights using this specialized data set.

[06:26] So this is being incorporated.

[06:29] Now this process typically uses supervised learning where we provide input-output

[06:34] pairs that demonstrate the kind of responses we want.

[06:37] So for example, if we're fine-tuning for technical support, we might provide thousands of examples of customer queries,

[06:46] and those would be paired with correct technical responses.

[06:50] The model adjusts its weights through back propagation

[06:53] to minimize the difference between its predicted outputs and the targeted responses.

[06:58] So we're not just teaching the model new facts here, we're actually modifying how it processes information.

[07:06] The model is learning to recognize domain-specific patterns.

[07:11] So, fine-tuning shows its strength when you particularly need a model that has very deep domain expertise.

[07:22] That's what we can really add in with fine tuning,

[07:25] and also, it's much faster, specifically at inference time.

[07:31] So when we are putting the queries in, it's faster than RAG because it doesn't need to search through external data,

[07:38] and because the knowledge is kind of baked into the model's weights, you don't need to maintain a separate vector database,

[07:43] but there's some downsides as well.

[07:46] Well, there's certainly issues here with the training complexity of all of this.

[07:54] You're going to need thousands of high quality training examples.

[07:59] There are also issues with computational cost.

[08:05] The computational cost for training this model can be substantial and is going to require a whole bunch of GPUs.

[08:12] And there's also challenges related to maintenance as well

[08:17] because unlike RAG where you can easily add new documents to your knowledge base at any point.

[08:22] Updating a fine-tune model requires another round of training

[08:27] and then perhaps most importantly of all there is a risk of something called catastrophic forgetting.

[08:37] Now that's when the model loses some of its general capabilities while it's busy learning these specialized ones.

[08:44] So finally let's explore prompt engineering.

[08:48] Now specifying Martin Keen who works at IBM versus

[08:52] Martin Keene who founded Keene Shoes, that's prompt engineering, but at its most basic.

[08:57] Prompt engineering goes far beyond simple clarification.

[09:01] So let's think about when we input a prompt, the model receives this prompt and it processes it through a series of layers,

[09:16] and these layers are essentially tension mechanisms and each one

[09:21] focuses on different aspects of your prompt text that came in.

[09:25] And by including specific elements in your prompt, so examples or context or how you want the format to look,

[09:32] you're directing the model's attention to relevant patterns it learned during training.

[09:38] So for example, telling a model to think about this step-by-step,

[09:42] that activates patterns it learnt from training data where methodical reasoning led to accurate results.

[09:49] So a well-engineered prompt can transform a model's output without any additional training or without data retrieval.

[09:59] So take an example of a prompt.

[10:02] Let's say we say, is this code secure?

[10:06] Not a very good prompt.

[10:08] An engineered prompt, it might read a bit more like this.

[10:12] It's much more detailed.

[10:13] Now.

[10:14] We haven't changed the model, we haven't added new data, we've just better activated its existing capabilities.

[10:23] Now I think the benefits to this are pretty obvious.

[10:26] One is that we don't need to change any of our back-end infrastructure here

[10:32] because there are no infrastructure changes at all in order to prompt better, it's all on the user.

[10:39] There's also the benefit that by doing this, You get to see immediate responses and immediate results to what you do.

[10:50] We don't have to add in new training data or any kind of data processing,

[10:53] but of course there are some limitations to this as well.

[10:58] Prompt engineering is as much an art as it is a science.

[11:01] So there is certainly a good amount of trial and error in this sort of process to find effective prompts,

[11:10] and you're also limited in what you can do here, you're limited

[11:15] to existing knowledge because you're not able to actually add anything else in here.

[11:23] No additional amount of prompt engineering is going to teach it truly new information.

[11:28] You're not going to the model anything that's outdated in the model.

[11:33] So we've talked about now RAG as being one option and we talked about fine tuning as being another one.

[11:44] And now, just now, we've talked about prompt engineering as well

[11:51] and I've really talked about those as three different distinct things here,

[11:57] but they're commonly used actually in combination.

[12:02] We might use all three together.

[12:05] So consider a legal AI system.

[12:07] RAG, that could retrieve specific cases and recent court decisions.

[12:12] The prompt engineering part, that could make sure that we follow proper legal document formats by asking for it.

[12:19] And then fine-tuning, that can help the model master firm-specific policies.

[12:24] I mean, basically, we can think of it like this.

[12:27] We can think that prompt engineering offers flexibility and immediate results, but it can't extend knowledge.

[12:34] RAG, that can extend knowledge, it provides up-to-date information, but with computational overhead.

[12:39] and then fine-tuning,

[12:41] that enables deep domain expertise, but it requires significant resources and maintenance.

[12:47] Basically, it comes down to picking the methods that work for you.

[12:52] You know, we've, we sure come a long way from vanity searching on Google.

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.