[0:00] Remember how back in the day people would
[0:03] Google themselves, you type your name into a search engine and you see what it knows about you?
[0:08] Well, the modern equivalent of that is to do the same thing with a chatbot.
[0:13] So when I ask a large language model, who is Martin Keen?
[0:18] Well, the response varies greatly depending upon which model I'm asking,
[0:22] because different models, they have different training data sets, they have a different knowledge cutoff dates.
[0:28] So what a given model knows about me, well, it differs greatly.
[0:32] But how could we improve the model's answer?
[0:36] Well, there's three ways.
[0:38] So let's start with a model here, and we're gonna see how we can improve its responses.
[0:44] Well, the first thing it could do is it could go out and it could perform a search,
[0:51] a search for new data that either wasn't in its training data set,
[0:54] or it was just data that became available after the model finished training,
[0:58] and then it could incorporate those results from the search back into its answer.
[1:03] That is called RAG or Retrieval Augmented Generation.
[1:11] That's one method.
[1:12] Or we could pick a specialized model, a model that's been trained on, let's say, transcripts of these videos.
[1:21] That would be an example of something called fine tuning,
[1:29] or we could ask the model a query that better specifies what we're looking for.
[1:36] So maybe the LLM already knows plenty about the Martin Keens of the world,
[1:41] but let's tell the model that we're referring to the Martin keen who works at IBM,
[1:45] rather than the Martin Keen that founded Keen Shoes.
[1:50] That is an example of prompt engineering.
[1:55] Three ways to get better outputs out of large language models, each with their pluses and minuses.
[2:03] Let's start with RAG.
[2:05] So let's break it down.
[2:06] First there's retrieval.
[2:08] So retrieval of external up-to-date information.
[2:12] Then there's augmentation.
[2:14] That's augmentation of the original prompt with the retrieved information added in.
[2:19] And then finally there's generation.
[2:22] That's generation of a response based on all of this enriched context.
[2:27] So we can think of it like this.
[2:30] So we start with a query and the query comes in to a large language model.
[2:40] Now, what RAG is gonna do is it's first going to go searching through a corpus of information.
[2:48] So we have this corpus here full of some sort of data.
[2:53] Now, perhaps, that's your organization's documents.
[2:56] So it might be spreadsheets, PDFs, internal wikis, you know, stuff like that,
[3:01] But unlike a typical search engine that just matches keywords,
[3:07] RAG converts both your question, the query, and all of the documents into something called vector embeddings.
[3:18] So these are all converted into vectors.
[3:20] essentially turning words and phrases into long lists of numbers that capture their meaning.
[3:27] So when you ask a query like, what was our company's revenue growth last quarter?
[3:34] Well, RAG will find documents that are mathematically similar in meaning to your question,
[3:38] even if they don't use the exact same words.
[3:41] So it might find documents mentioning fourth quarter performance or quarterly sales.
[3:48] Those don't contain the keyword revenue growth, but they are semantically similar.
[3:54] Now, once RAG finds the relevant information, it adds this information
[3:59] back into your original query before passing it to the language model.
[4:06] So instead of the model just kind of guessing based on its training data,
[4:09] it can now generate a response that incorporates your actual facts and figures.
[4:15] So this makes RAG particularly valuable when you are looking for information that is up to date,
[4:24] and it's also very valuable when you need in to add in information that is domain specific as well,
[4:34] but there are some costs to this.
[4:38] Let's go with the red pen.
[4:40] So one cost, that would be the cost of performance.
[4:45] for performing all of this, because you have this retrieval step here, and that
[4:50] adds latency to each query compared to a simple prompt to a model.
[4:55] There are also costs related to just kind of the processing of this as well.
[5:01] So if we think about what we're having to do here, we've got documents that need to be vector embeddings,
[5:07] and we need to store these vector embedding in a database.
[5:11] All of this adds to processing costs, it adds to infrastructure costs
[5:15] to make this solution work.
[5:17] All right, next up, fine tuning.
[5:20] So remember how we discussed getting better answers about me by
[5:24] training a model specifically on, let's say, my video transcripts.
[5:26] Well, that is fine tuning in action.
[5:30] So what we do with fine tuning is we take a model, but specifically an existing model.
[5:40] and that existing model has broad knowledge.
[5:44] And then we're gonna give it additional specialized training on a focused data set.
[5:51] So this is now specialized to what we want to develop particular expertise on.
[5:58] Now, during fine tuning, we're updating the model's internal parameters through additional training.
[6:05] So the model starts out with some weights here.
[6:10] like this, and those weights were optimized during its initial pre-training.
[6:16] And as we fine tune, we're making small adjustments here to the model's weights using this specialized data set.
[6:26] So this is being incorporated.
[6:29] Now this process typically uses supervised learning where we provide input-output
[6:34] pairs that demonstrate the kind of responses we want.
[6:37] So for example, if we're fine-tuning for technical support, we might provide thousands of examples of customer queries,
[6:46] and those would be paired with correct technical responses.
[6:50] The model adjusts its weights through back propagation
[6:53] to minimize the difference between its predicted outputs and the targeted responses.
[6:58] So we're not just teaching the model new facts here, we're actually modifying how it processes information.
[7:06] The model is learning to recognize domain-specific patterns.
[7:11] So, fine-tuning shows its strength when you particularly need a model that has very deep domain expertise.
[7:22] That's what we can really add in with fine tuning,
[7:25] and also, it's much faster, specifically at inference time.
[7:31] So when we are putting the queries in, it's faster than RAG because it doesn't need to search through external data,
[7:38] and because the knowledge is kind of baked into the model's weights, you don't need to maintain a separate vector database,
[7:43] but there's some downsides as well.
[7:46] Well, there's certainly issues here with the training complexity of all of this.
[7:54] You're going to need thousands of high quality training examples.
[7:59] There are also issues with computational cost.
[8:05] The computational cost for training this model can be substantial and is going to require a whole bunch of GPUs.
[8:12] And there's also challenges related to maintenance as well
[8:17] because unlike RAG where you can easily add new documents to your knowledge base at any point.
[8:22] Updating a fine-tune model requires another round of training
[8:27] and then perhaps most importantly of all there is a risk of something called catastrophic forgetting.
[8:37] Now that's when the model loses some of its general capabilities while it's busy learning these specialized ones.
[8:44] So finally let's explore prompt engineering.
[8:48] Now specifying Martin Keen who works at IBM versus
[8:52] Martin Keene who founded Keene Shoes, that's prompt engineering, but at its most basic.
[8:57] Prompt engineering goes far beyond simple clarification.
[9:01] So let's think about when we input a prompt, the model receives this prompt and it processes it through a series of layers,
[9:16] and these layers are essentially tension mechanisms and each one
[9:21] focuses on different aspects of your prompt text that came in.
[9:25] And by including specific elements in your prompt, so examples or context or how you want the format to look,
[9:32] you're directing the model's attention to relevant patterns it learned during training.
[9:38] So for example, telling a model to think about this step-by-step,
[9:42] that activates patterns it learnt from training data where methodical reasoning led to accurate results.
[9:49] So a well-engineered prompt can transform a model's output without any additional training or without data retrieval.
[9:59] So take an example of a prompt.
[10:02] Let's say we say, is this code secure?
[10:06] Not a very good prompt.
[10:08] An engineered prompt, it might read a bit more like this.
[10:12] It's much more detailed.
[10:13] Now.
[10:14] We haven't changed the model, we haven't added new data, we've just better activated its existing capabilities.
[10:23] Now I think the benefits to this are pretty obvious.
[10:26] One is that we don't need to change any of our back-end infrastructure here
[10:32] because there are no infrastructure changes at all in order to prompt better, it's all on the user.
[10:39] There's also the benefit that by doing this, You get to see immediate responses and immediate results to what you do.
[10:50] We don't have to add in new training data or any kind of data processing,
[10:53] but of course there are some limitations to this as well.
[10:58] Prompt engineering is as much an art as it is a science.
[11:01] So there is certainly a good amount of trial and error in this sort of process to find effective prompts,
[11:10] and you're also limited in what you can do here, you're limited
[11:15] to existing knowledge because you're not able to actually add anything else in here.
[11:23] No additional amount of prompt engineering is going to teach it truly new information.
[11:28] You're not going to the model anything that's outdated in the model.
[11:33] So we've talked about now RAG as being one option and we talked about fine tuning as being another one.
[11:44] And now, just now, we've talked about prompt engineering as well
[11:51] and I've really talked about those as three different distinct things here,
[11:57] but they're commonly used actually in combination.
[12:02] We might use all three together.
[12:05] So consider a legal AI system.
[12:07] RAG, that could retrieve specific cases and recent court decisions.
[12:12] The prompt engineering part, that could make sure that we follow proper legal document formats by asking for it.
[12:19] And then fine-tuning, that can help the model master firm-specific policies.
[12:24] I mean, basically, we can think of it like this.
[12:27] We can think that prompt engineering offers flexibility and immediate results, but it can't extend knowledge.
[12:34] RAG, that can extend knowledge, it provides up-to-date information, but with computational overhead.
[12:39] and then fine-tuning,
[12:41] that enables deep domain expertise, but it requires significant resources and maintenance.
[12:47] Basically, it comes down to picking the methods that work for you.
[12:52] You know, we've, we sure come a long way from vanity searching on Google.