[0:05] so let's get started uh so I'll be [0:07] talking about building llms today um so [0:10] I think a lot of you have heard of llms [0:12] before uh but just as a quick recap uh [0:16] llms standing for large language models [0:18] are basically all the chat Bots uh that [0:20] you've been hearing about recently so uh [0:24] Chad GPT from open ey Claud from [0:27] entropic Gemini and and lman other type [0:30] of models like this and today we'll be [0:32] talking about how do they actually work [0:34] so it's going to be an overview because [0:35] it's only one lecture and it's hard to [0:37] compress everything but hopefully I'll [0:39] touch a little bit about all the [0:40] components that are needed to train uh [0:42] some of these llms uh also if you have [0:44] questions please interrupt me and ask uh [0:47] if you have a question most likely other [0:49] people in the room or on Zoom have other [0:52] have the same question so please ask um [0:56] great so what matters when training llms [0:59] um so there a few key components that [1:01] matter uh one is the architecture so as [1:04] you probably all know LMS are newal [1:06] networks and when you think about new [1:08] networks you have to think about what [1:10] architecture you're using and another [1:12] component which is really important uh [1:13] is the training loss and the training [1:15] algorithm um so how you actually train [1:18] these models then it's data so uh what [1:21] do you train these models on um the [1:24] evaluation which is how do you know [1:26] whether you're actually making progress [1:28] towards the goal of of uh llms and then [1:32] the system component so that is like how [1:34] do you actually make these models run on [1:37] uh Modern Hardware which is really [1:38] important because these models are [1:39] really large um so now more than ever [1:42] system is actually really an important [1:44] topic um for [1:46] llms so those five components um You [1:50] probably all know that llms and if you [1:52] don't know LMS are all based on [1:54] Transformers or at least some version of [1:56] Transformers uh I'm actually not going [1:58] to talk about the AR lecture today uh [2:01] one because I gave a SE lecture on um [2:04] Transformers a few weeks ago and two [2:06] because you can find so much information [2:08] online on uh Transformers but I think [2:10] you can it's there's much less [2:12] information about the other four topics [2:14] so I really want to talk about those um [2:17] another thing to say is that most of [2:19] Academia actually focuses on [2:21] architecture and training algorithm and [2:23] losses um as academics and I've done [2:26] that for a lot big part of my career is [2:29] simply we like thinking that this is uh [2:31] like we make new architectures new [2:33] models and it it seems like it's very [2:36] important but in reality honestly what [2:38] matters in practice is mostly the three [2:40] other topics so data evaluation and [2:43] systems uh which is what of most of [2:45] Industry actually focuses on um so [2:48] that's also one of the reason why I [2:49] don't want to talk too much about the [2:51] architecture uh because really the rest [2:52] is super [2:53] important um great so overview of the [2:56] lecture I'll be talking about [2:57] pre-training so pre-training uh you [2:59] probably heard that word this is the [3:01] general word this is kind of the [3:02] classical language modeling uh Paradigm [3:06] uh where you basically train your [3:07] language model to essentially model all [3:09] of internet and then there's a post [3:11] training which is a more recent Paradigm [3:13] which is taking these large language [3:14] models and making them essentially AI [3:17] assistants um so this is more of a [3:19] recent Trend since Chad GPT uh so if you [3:23] ever heard of gpt3 or gpt2 that's really [3:25] pre-training land uh if you heard of [3:28] chat GPT which you probably have this is [3:30] really posttraining land uh so I'll be [3:32] talking about both but I'll start with [3:34] pre-training and uh specifically I'll [3:36] talk about what is the task of [3:38] pre-training llms and what is the laws [3:40] that people actually [3:42] use so language modeling this is a quick [3:45] recap uh language models at a high level [3:48] are simply models of probability [3:50] distribution over sequences of tokens or [3:52] of words so it's basically some uh model [3:56] of P of X1 to XL where X1 is basically [3:59] word one and Excel is the last one in [4:01] the sequence or in the sentence um so [4:04] very concretely if you have a sentence [4:06] like the mouse ate the cheese what the [4:08] language model gives you is simply a [4:10] probability of this sentence being [4:13] uttered by a human or being found on on [4:16] online uh so if you have another [4:18] sentence like the the mouse at cheese uh [4:22] here there's grammatical mistakes so the [4:23] model should know that this uh should [4:25] have some syntactic knowledge so it [4:27] should know that this has less [4:29] likelihood of appearing [4:31] online uh if you have another sentence [4:34] like the cheese ate the mouse uh then [4:36] the model should hopefully know about [4:38] the fact that usually cheese don't eat [4:41] Mouse um so there's some semantic [4:43] knowledge and this is less likely than [4:44] the first sentence so this is basically [4:46] at a high level what language models are [4:49] um one word that you probably have been [4:51] hearing a lot in the news are generative [4:53] models uh so this is just something that [4:55] can generate models that can generate [4:57] sentences or can generate some data uh [4:59] the reason why we say language models [5:01] are generative models is that once you [5:02] have a model of a distribution you can [5:04] simply sample from this model and now we [5:06] can generate data uh so you can generate [5:08] sentences uh using a language [5:11] model so the type of models that uh [5:14] people are all currently using are what [5:16] we call Auto regressive language models [5:19] and the key idea of autor regressive [5:21] language models is that you take this [5:23] distribution over words and you [5:26] basically decompose it into the into the [5:28] distribution of the first word multiply [5:31] the by the distribution of or the [5:33] likelihood of the distribution of the [5:34] second word given the first word uh [5:37] multiply by P of the third word given [5:39] the first two words um so there's no [5:42] approximation here this is just the [5:43] chain rule of probability which you [5:44] hopefully all know about uh really no [5:46] approximation this is just one way of [5:48] modeling a [5:49] distribution uh so slightly more [5:51] concisely you can write it as a product [5:53] of U of PS of the next word given [5:56] everything which happened in the past so [5:58] of the context and uh so this this is [6:00] what we call Auto regressive language [6:02] models again this is really not the only [6:04] way of modeling distribution this is [6:06] just one way uh it has some benefits and [6:09] some downsides one downside of [6:11] autoaggressive language models is that [6:13] when you actually sample from this [6:14] autoaggressive language model you [6:16] basically have a for Loop which [6:17] generates the next word then conditions [6:20] on that next word and then regenerate an [6:22] other word so basically if you have a [6:24] longer sentence that you want to [6:25] generate you it takes more time to [6:27] generate it uh so there are some [6:28] downsides of this current Paradigm but [6:31] that's what we currently have so I'm [6:33] going to talk about this [6:34] one uh great so Auto regressive language [6:37] models at a high level um what the task [6:40] of autoregressive language model is is [6:42] simply predicting the next word as I [6:43] just said so if you have a sentence like [6:45] she likely prefers uh one potential next [6:48] word might be dogs and the the way we do [6:51] it is that we first tokenize so you take [6:55] these words or subwords you tokenize [6:57] them um and then you give an IDE for [7:00] each token so here you have 1 2 three uh [7:03] then you pass it through this black box [7:05] as I already said we're not going to [7:06] talk about the architecture you just [7:07] pass it pass it through a model and you [7:10] then get a distribution a probability [7:12] distribution over the next word over the [7:15] next token and then you sample uh from [7:19] this distribution you get a new token [7:21] and then you DET tokenize so you get a [7:23] new ID you then DET toonize and that's [7:25] how you basically sample from a language [7:27] model uh one thing which is important to [7:29] not is that the last two TS uh two steps [7:31] are actually only need needed during [7:33] inference uh when you do training you [7:35] just need to predict uh the most likely [7:37] token and you can just compare to the [7:39] real token which happen in practice and [7:41] then you basically change the weights of [7:43] your model to increase the probability [7:45] of generating that [7:48] token um great so autoaggressive neural [7:51] language models so to be slightly more [7:53] specific still without talking about the [7:54] architecture uh the first thing we do is [7:57] that we have all of these oh sorry yes [7:59] on the previous slide when you're [8:01] predicting the probability of the next [8:02] tokens does this mean that your final [8:04] like output VOR has to be the same [8:06] dimensionality as the number of tokens [8:08] that you have yes how do you deal with [8:11] like if you have more to like if you're [8:13] adding more tokens to your cor something [8:16] yeah so we're going to talk about [8:17] tokenization actually later uh so you [8:19] will get some sense of this you [8:22] basically can deal with adding new [8:24] tokens I am I'm kind of exaggerating [8:26] there are methods for doing it but [8:27] essentially people don't do it um so [8:30] it's really important to think about how [8:32] you tokenize your text and that's why [8:33] we'll talk about that later but it's a [8:36] very good point to notice that you [8:37] basically the vocabulary size so the [8:38] number of tokens that you have is [8:40] essentially the output of your uh [8:42] language model so it's actually pretty [8:44] pretty [8:45] large okay so autoaggressive new [8:47] language models first thing you do is [8:49] that you take every word or every token [8:52] you embed them so you get a um some [8:55] Vector representation for each of these [8:57] tokens um you pass them through some ual [8:59] Network as we said it's a Transformer [9:01] then you get a representation for all [9:03] the word in all the words in the context [9:06] so it's basically representation of the [9:08] entire sentence uh you pass it through a [9:10] linear layer as you just said to [9:13] basically map it to the number so that [9:16] the output the number of outputs is the [9:17] number of tokens uh you then pass it [9:20] through some soft Max and you basically [9:22] get uh probity distribution over the [9:25] next words given every word in the [9:27] context [9:30] and the law that you use is basically [9:32] it's essentially a task of classifying [9:34] the next token so it's a very simple [9:36] kind of machine learning task so you use [9:37] the cross entry P loss where you [9:39] basically you look at the actual Target [9:43] that happened which is a target [9:45] distribution which is a one hot encoding [9:46] which here in this in this case says I [9:49] saw uh the real word that happened is [9:51] cat so that's a one hot um distribution [9:54] over cat and here this is the actual uh [9:57] do you see my mouse oh yeah this is the [9:59] distribtion that you generated and [10:00] basically you do cross entropy which [10:02] really just increases the probability of [10:03] generating cat and decreases all the the [10:05] probility of generating all the other [10:07] tokens one thing to notice is that as [10:10] you all know again uh this is just [10:12] equivalent to maximizing the text log [10:14] like the text log likelihood because you [10:16] can just rewrite the the max over the [10:20] probability of um this autoregressive [10:22] language moding task as just being this [10:25] minimum over I just added the log here [10:27] and minus which is just the minimum of [10:30] the loss which is the cross enty loss so [10:31] basically minimizing the loss is the [10:33] same thing as maximizing the likelihood [10:35] of your text any question [10:42] questions [10:43] okay [10:45] tokenizer um so this is one thing that [10:48] people usually don't talk that much [10:50] about tokenizers are extremely important [10:53] uh so it's really important that you [10:54] kind of understand at least uh what they [10:56] do at a high level so why do we need [10:58] token in the first place uh first it's [11:01] more General than words so one simple [11:04] thing that you might think is oh we're [11:05] just going to take every word that we [11:06] will have you just say every word is a [11:09] new is a token in its own um but then [11:11] what happens is if there's a typo in [11:13] your word then you might not have any [11:16] token associated with this this word [11:19] with a typo and then you don't know how [11:20] to actually pass this word with a typo [11:22] into a large language model so what do [11:24] you do next and also even if you think [11:27] about words words is a very like words [11:29] are fine with like Latin based languages [11:32] uh but if you think about a language [11:34] like taii you won't have a simple way of [11:36] tokenizing by spaces because there are [11:37] no spaces between words um so really uh [11:41] tokens are much more General Than Words [11:43] first thing second thing that you might [11:45] think is that you might tokenize every [11:47] sentence character by character you [11:49] might say a is one token b is another [11:51] token uh that would actually work and [11:54] probably very well the issue is that [11:56] then your sequence becomes super long [11:58] and as you probably remember from the [12:00] lecture on on Transformers uh the [12:02] complexity uh grows quadratically with [12:05] the length of sequences so you really [12:07] don't want to have a super long sequence [12:09] um so tokenizers basically try to deal [12:13] with those two problems and give common [12:17] subsequences a certain token and usually [12:20] how you should be think about is around [12:22] uh an average every token is around [12:24] three four letters [12:26] um and there are many algorithm for [12:29] tokenization I'll just talk about one of [12:31] them to give you a high level which is [12:32] what we call bite P en coding which is [12:34] actually pretty common one of the two [12:35] most common tokenizers and the way that [12:38] you train a tokenizer is that first you [12:40] start with a very large Corpus of text [12:42] and here I'm really not talking about [12:44] training a large language model yet this [12:45] is purely for the tokenization step uh [12:48] so this is my large Corpus of text with [12:50] these five words um then you associate [12:54] every character in this Corpus of text a [12:57] different token uh so here I just split [12:59] up every character with a different [13:01] token uh and I just color coded all of [13:04] those tokens and then what you do is [13:06] that you go through your text and every [13:08] time you see pairs of tokens that are [13:11] very common the most common pair of [13:13] token you just merge them so here you [13:15] see three times the the the tokens T and [13:19] O next to each other so you're just [13:21] going to say this is a new token and [13:22] then you continue you repeat that so now [13:24] you have to talk which happens three [13:27] times to with an E that happens sorry [13:31] two times and an token which happens [13:34] twice and then ex which also happen [13:36] twice so this is that if you were to [13:39] train a tokenizer on this Corpus of text [13:41] which is very small that's how you would [13:43] uh finish with a token with a pre like a [13:45] trained tokenizer uh in reality you do [13:48] it on on much larger corpuses of text um [13:51] and this is the real tokenizer of uh [13:54] actually I think this is gpt3 or chat [13:56] GPT uh and here you see how it would [13:59] actually separate these words so [14:00] basically you see the same thing as what [14:01] we gave in the previous example token [14:04] becomes its own token so tokenizer is [14:08] actually split up into two tokens token [14:10] and iser um so yeah that's all about [14:14] tokenizers any questions on that yeah [14:16] how do you deal with spes and how do you [14:18] deal [14:19] with yeah so actually there's a a step [14:22] before tokenizers which is what we call [14:24] pre- tokenizers which is exactly what [14:26] you just said uh so this is mostly [14:29] in theory there's no reason to deal with [14:31] spaces and punctuation separately you [14:34] could just say every space gets its own [14:36] token every um uh punctuation get its [14:39] own token and you can just do all the [14:41] merging the problem is that so there's [14:43] an efficiency question actually training [14:45] these tokenizes takes a long time uh so [14:48] you better off because you have to [14:49] consider every pair of token so what you [14:52] end up doing is saying if there's a [14:53] space this is very like pre- tokenizes [14:55] are very English specific you say if [14:57] there's a space we're not going to start [14:59] looking at the the token that came [15:00] before and the token that came [15:02] afterwards so you're not merging in [15:04] between spaces but this is just like a [15:07] optimiz like a computation optimization [15:10] you could theoretically just deal with [15:11] it um the same way as you deal with any [15:13] other character and yeah when you merge [15:17] tokens do you delete the tokens that you [15:19] merged away or do you keep the the [15:21] smaller tokens that merge um you [15:23] actually keep the smaller tokens I mean [15:25] in reality it doesn't matter much [15:26] because um usually on large Corpus of [15:31] text you will have actually everything [15:32] uh but you usually keep the small ones [15:34] and the reason why you want to do that [15:35] is because if in case there's as we said [15:37] before you have some um some grammatical [15:40] mistakes so some typos you still want to [15:42] be able to represent these words by [15:44] character um so yeah yes are the tokens [15:50] unique so I mean say in this case T Ken [15:54] is there only one occurrence or could do [15:56] you need to leave multiple occurr so [15:59] they could have take on different [16:00] meanings or something oh oh I see what [16:02] you say no no it's every token has its [16:05] own uh unique ID um so a usual this is a [16:10] great question for example if you think [16:11] about a bank which could be bank for [16:14] like money or bank like water um it will [16:16] have the same token but the model will [16:18] learn the Transformer will learn that [16:20] based on the words that are around it it [16:23] should associate that I'm saying I'm [16:25] being very high wavy here but associate [16:27] that with the with a with a [16:29] representation that is either more like [16:31] the bank money side or the Bank water [16:34] side um but that's a Transformer that [16:35] does that it's not a [16:37] tokenizer yes yeah so you mentioned [16:40] during tokenization keep the smaller [16:41] tokens you started with right like if [16:44] you start with a t you keep the T and [16:46] then you build your tokenizer to the [16:48] that you can now in token so let's say [16:50] maybe you didn't train on token but like [16:52] in your data you are trying to encode [16:54] token so how does the tokenizer know to [16:57] encode it with token or [17:00] a great question you basically when you [17:01] so when you tokenize so that's after [17:03] training of the tokenizer when you [17:04] actually apply the tokenizer you [17:06] basically always choose the largest uh [17:09] token that you can apply uh so if you [17:11] can do token you will never do T you [17:13] will always do token um but there's [17:16] actually so people don't usually talk [17:18] that much about tokenizers but uh [17:20] there's a lot of of computational [17:22] benefits uh or computational tricks that [17:24] you can do for making these things [17:26] faster uh so I really don't think we and [17:28] honestly I think a lot of people think [17:29] that we should just get away from [17:31] tokenizers um and just kind of tokenize [17:34] character by character or bites by bites [17:37] uh but as I said right now there's this [17:38] issue of like length uh but maybe one [17:40] day like in five or 10 years we will [17:42] have different architectures that don't [17:43] scale quadratically with the length of [17:45] the sequence and uh maybe we'll um yeah [17:49] move away from tokenizes so can you [17:51] share with us the drawback why do people [17:53] want to move away from the tokenizer oh [17:56] um yeah so think [18:00] one good example is uh math if you think [18:03] about math actually numbers right now [18:06] are not tokenized so for example 327 [18:08] might have its own token which means [18:10] that models when they see numbers they [18:13] don't see them the same way as we do and [18:15] this is very annoying because what I [18:17] mean the reason why we can kind of [18:18] generalize with math is because we can [18:20] deal with every every letter separately [18:22] and we can then do composition where you [18:24] know that basically if you add stuff [18:26] it's just the same thing as adding every [18:28] one separately plus like whatever the [18:29] unit that you add so they can do that um [18:32] so then you have to do like special [18:34] tokenization and like one of the big [18:36] changes that GPT 4 did uh is changing [18:40] the way that they tokenize uh code so [18:43] for example uh if you have code you know [18:45] you have like often in Python these four [18:46] spaces at the beginning those were dealt [18:49] with uh kind of strangely before um and [18:52] as a result like the model couldn't [18:54] really understand uh how to deal with [18:56] code uh so so toiz actually a lot um [19:00] okay so I'll move on right now but we [19:03] can come back later on token Isis great [19:06] so we talked about the task the L the [19:07] tokenizer let's talk a little bit about [19:10] evaluation uh so the way that LMS are [19:12] usually evaluated is what we call is [19:14] using what we call perplexity um at a [19:17] high level it's basically just your [19:18] validation loss uh the slight difference [19:20] with perplexity is that we use something [19:22] that is slightly more interpretable [19:24] which is that we use the average per [19:26] token loss and then you expon entiate it [19:29] and the reason why you exponentiate it [19:31] is because you want I mean the loss has [19:33] a log inside and you like one humans are [19:36] actually pretty bad at thinking in log [19:37] space but two logs depend on the base of [19:40] the log uh while when you exponentiate [19:42] you basically have everything in the uh [19:45] kind of the vocabulary size uh unit um [19:48] and the average proten is just so that [19:50] your your complexity is independent of [19:52] the length of your sequence um so [19:54] perplexity is just two to the power uh [19:56] average of the loss of the sequence [19:59] um so perplexity is between one and the [20:03] length of the vocabulary of your [20:04] tokenizer uh one it's simply well if you [20:07] predict perfectly the thing which uh [20:09] every word then every word will have [20:12] basically product of ones uh so the best [20:15] perplexity you can have is one if you [20:17] really have no idea you basically [20:18] predict with one divided by uh size of [20:21] vocabulary um and then you do simple [20:23] math and you basically get perplexity of [20:25] size of vocabulary uh so the intuition [20:27] of perplexity is that basically the [20:29] number of tokens that your model is kind [20:31] of hesitating between uh so if you if [20:33] your model is perfect it doesn't [20:34] hesitate it know exactly the word if it [20:37] really has no idea then it hesitates [20:39] between uh all of the [20:42] vocabulary uh so perplexity really [20:45] improved that's perplexity on a standard [20:48] data set between 2017 and 2023 it it [20:51] went from kind of 70 tokens to less than [20:54] 10 tokens over these five six years so [20:56] that means that the models were [20:58] previously as dating between 70 words [21:00] every time it was generating a word and [21:02] now it's as dating between like less [21:04] than 10 words so that's much better [21:06] perplexity is actually not used anymore [21:08] in academic benchmarking mostly because [21:10] it depends on the tokenizers that you [21:12] use uh it depends on the actual data [21:14] that people are evaluating on but it's [21:16] still very important for development of [21:18] llms so when you when you actually train [21:20] your own llm people will still really [21:22] look at the [21:24] perplexity uh one common other way and [21:28] now more common in Academia of [21:30] evaluating these llms is just by taking [21:33] all the classical NLP benchmarks and [21:35] I'll give you a few examples later and [21:37] just kind of aggregating everything um [21:39] so collect as many automatically [21:41] evaluatable benchmarks and just evaluate [21:44] across all of them um so one such if uh [21:48] or actually two such uh benchmarks of [21:51] what we call uh Helm which is from [21:53] Stanford and another one is the hugging [21:55] face open LM leader board which are the [21:57] probably two two most common ones right [21:58] now um so just to give you an idea in [22:02] Helm there are all of these type of [22:03] tasks which are mostly things that can [22:06] be easily evaluated uh like question [22:09] answering so think about many different [22:11] question answering uh tasks um and the [22:14] benefit with question answering is that [22:15] you usually know what is the real answer [22:18] um so you can the way that you evaluate [22:20] these models and I'll give you a [22:21] concrete example in one second um is [22:23] that you can just look at How likely the [22:25] language model is to generate the real [22:28] answer compared to some other answers [22:30] and that's essentially at a high level [22:32] how you evaluate these models um so to [22:34] give you a specific example mlu is [22:36] probably the most common um academic [22:39] Benchmark for [22:41] llms uh and this is just a collection of [22:44] many question and answers in all of [22:46] those domains for example College [22:48] medicine College physics astronomy and [22:51] these type of topics and the questions [22:53] are things like so this in astronomy [22:55] what is true for type 1 a supernova then [22:58] you give uh four different potential [23:01] answers and you just ask the model which [23:03] one is more likely so there are many [23:05] different ways of doing it either you [23:07] can look at the likelihood of generating [23:09] all these answers uh or you can ask the [23:11] model which one is the most likely uh so [23:13] there are different ways that you can [23:14] promp the model but at a high level you [23:16] know which one is correct and there are [23:17] three other mistakes um yes kind [23:22] creating is like unconstrained text as [23:24] the output yeah how do you evaluate a [23:26] model if it give something that's you [23:29] know semantically completely identical [23:32] but is not the exact token list that [23:35] expect yeah so that's a great question [23:37] I'll talk more about that later here in [23:39] this case we don't do unconstrained so [23:41] the way you would evaluate MML is [23:43] basically either you you ask the first [23:46] question and then you look at the [23:47] likelihood of the model generating a the [23:50] likelihood of the model generating b c [23:53] and d and you look at which one is the [23:54] most likely or you can as the model out [23:57] of ABC d which one is the most likely [23:59] and you look at whe the to the most [24:01] likely next token is A B C or D so uh [24:04] you can strain the model to say it can [24:06] only answer these four things you say [24:09] you constraint the model you mean you [24:11] constraint The Prompt or do you mean of [24:13] its whole probability distribution [24:15] outputs you only comparing the outputs [24:18] like you're only comparing the [24:20] a so uh in the second case I gave you [24:23] you would do exactly the I actually you [24:24] would do both you would prompt the model [24:26] saying ABC or D plus you would constrain [24:28] to only uh look at these two these four [24:31] tokens in the first case you don't even [24:33] need to generate anything so in the [24:34] first case you literally just look given [24:36] that it's a language model it can give a [24:38] distribution over sentences you just [24:40] look at what is the likelihood of [24:42] generating all of these words what is [24:45] the likelihood of generating the second [24:47] choice and you just look at whether the [24:49] most likely sentence is actually the [24:52] real answer so you don't actually sample [24:55] from it you really just use P of x one [24:58] to excel does that make sense uh that [25:01] being said evaluation of open-ended [25:04] questions is something we're going to [25:05] talk about later and is actually really [25:07] important and really challenging yes [25:10] earlier you mentioned that um like um [25:13] metrics like flexity are not are not [25:15] like usually used because it depends on [25:17] like how you do your terization some [25:19] design choices I was wondering if you [25:21] could speak more to that oh um yeah so [25:25] think about perplexity I told you [25:27] perplexity is between one and vocabulary [25:29] size so now imagine that Chad GPT uses a [25:32] tokenizer that has like 10,000 tokens [25:35] but Gemini from Google uses a tokenizer [25:37] that had 100,000 uh potential tokens [25:41] then actually the Gemini one will will [25:44] have like the upper bound of the the [25:46] perplexity that you can get is actually [25:47] worse for Gemini than for Chad GPT does [25:51] that make sense so that's just an idea [25:54] it's actually a little bit more [25:55] complicated than that but that's just [25:56] like one uh first or the bit of you can [25:58] see that the tokenizer actually [26:01] matters um [26:04] great okay so evaluation challenges [26:07] there are many I'll just talk about two [26:09] really briefly uh one as I told you [26:11] there are two ways of doing evaluation [26:13] for these mlu actually there are many [26:15] more than two but I give you two [26:16] examples um and it happens that for a [26:19] long time even though that was a very [26:21] classical Benchmark that everyone used [26:23] uh actually different uh different [26:26] companies and different um different uh [26:30] uh different organization were actually [26:32] using different ways of evaluating mlu [26:35] and as a result you could you get [26:36] completely different results for example [26:38] Lama [26:39] 65b uh which was the first model of meta [26:42] in the Lama series uh had on Helm 63.7 [26:47] accuracy but on this other um Benchmark [26:50] had like [26:51] 48.8 um so really the way that you [26:54] evaluate and this is not even talking [26:56] about prompting this is really just kind [26:58] of the the way that you evaluate the uh [27:00] the models prompting is another issue so [27:02] really there are a lot of [27:03] inconsistencies it's not as easy as it [27:06] looks uh first thing yeah sorry how can [27:09] we make sure that all these models AR [27:10] trained on The Benchmark okay second [27:13] thing this is a great question uh chain [27:15] test contamination uh this is something [27:18] which I would say is really important in [27:22] Academia in uh given that the talk is [27:25] mostly about training large language [27:26] models uh for companies it's maybe not [27:29] that important CU they know what they [27:31] trained on uh for us we have no idea so [27:35] for us it's a real problem uh so there [27:37] are many different ways of trying to [27:39] test whether uh the test set sorry [27:43] whether the test set was actually in the [27:44] training Set uh one kind of cute trick [27:48] um that people uh in in the lab on T lab [27:52] have found is that what you can do is [27:54] that given that most of the data set [27:56] online are not randomized [27:58] you can just look at and in that [28:00] language models what they do is just [28:02] predict the next word um you can just [28:04] look at the entire test Set uh what if [28:07] you generate all the examples in order [28:10] versus all the examples in a different [28:13] order and if it's more likely to [28:15] generate a thing in order given that [28:17] there's no real order there then it [28:19] means that probably was in a training [28:20] set does that make sense um so there are [28:23] many that's like one of them there are [28:24] many other ways of doing it train test [28:27] contamination again not that important [28:28] for development really important for [28:30] academic [28:32] benchmarking great so there are many [28:34] other challenges but uh I'll move on for [28:36] now great data um so data is another [28:41] really big topic um at a high level [28:44] people just say oh you basically train [28:46] large language models on all of Internet [28:48] what does that even mean um so or people [28:51] sometimes say all of clean internet [28:53] which is even less defined um so [28:56] internet is very dirty and really not [28:58] representative of what we want in [29:00] practice if I download a random website [29:03] right now you would be shocked at what [29:05] is in there it's definitely not your [29:07] Wikipedia um so I'll go really briefly [29:12] on like what people do um I can answer [29:14] some questions but I mean data is on its [29:17] own is a huge topic uh basically first [29:20] what you do is download all of Internet [29:22] what that means is that you use uh web [29:24] crowlers that will go on every web page [29:27] on Internet or every web page that is um [29:30] on Google uh and that is around 250 [29:34] billion pages right now um and that's [29:36] around one petabyte of of data so this [29:39] is actually a common common C is one web [29:42] crowler so people will usually write [29:43] their own web crowlers what they do is [29:45] that they use standard web crowlers and [29:47] we common crawl is one of them uh that [29:50] basically every month adds all the new [29:52] websites that were added on uh internet [29:55] that are found by by Google and they put [29:57] it in a big uh basically a big data set [30:00] um so that's on common call you have [30:02] around 250 billion pages right now so 1 [30:05] E6 gigabytes of data once you have this [30:09] uh so this is a random web page like [30:11] literally random uh from this common [30:13] craw and what you see is that one it [30:15] really doesn't look at type of things [30:17] that you would usually see but actually [30:19] so this is an HTML page uh it's hard to [30:22] see but if you look through you will see [30:25] some content for example here here uh [30:29] tesing world is your ultimate source for [30:32] the system X high performance server and [30:34] then you have three dots so you don't [30:35] even the sentence is not even finished [30:37] that's how a random internet looks like [30:40] uh so of course it's not that useful if [30:42] you just train a like large language [30:44] model to generate things like this so [30:46] what are some of the steps that are [30:47] needed first one you extract the text [30:50] from the HTML so that's what I just try [30:52] to do by looking at uh basically the [30:54] correct text uh there are a lot of [30:56] challenges by through this for example [30:58] extracting math is actually very [31:00] complicated but pretty important for [31:01] training large language models um or for [31:04] example boiler plates a lot of your [31:06] forums will have the same type of [31:07] headers the same type of Footers uh you [31:10] don't want to repeat all of this in your [31:12] data um then you will filter undesirable [31:15] content uh so not safe for work harmful [31:19] content pii uh so usually every company [31:21] has basically a a black list of websites [31:25] that they don't want to train the models [31:27] on that Black List is very long and you [31:29] basically say if it comes from there we [31:31] don't train on this there are other ways [31:32] of doing these things is that you can [31:34] train a small model for classifying what [31:36] is pii removing these things um it's [31:40] hard every Point here that I'm going to [31:42] show you is like a hard amount of work [31:46] uh but I'm going to go go quickly [31:47] through it so filter undesirable content [31:50] second or fourth is the dup D [31:53] duplication as I said um you might have [31:56] things like headers and Footers in [31:58] forums that are always the same you want [32:00] to remove that another thing that you [32:01] might have is a lot of URLs that are [32:04] different but actually show the same [32:06] website um and you might also have a lot [32:10] of like U um paragraphs that come from [32:13] like common books that are basically [32:14] duplicated a thousand times or 10,000 [32:17] times on internet so you have to [32:19] duplicate also very challenging uh [32:22] because you have to do that at scale [32:24] once you do duplication you will do some [32:26] heuristic filtering you will try to [32:28] remove low quality documents uh the way [32:31] you do that are things like rules-based [32:33] um filtering for example if you see that [32:36] there are some outlier tokens if the [32:38] distribution of tokens in the website is [32:39] very different than the usual [32:40] distribution of tokens then it's [32:42] probably some outlier if you see that [32:44] the length of the words in this website [32:46] is super long there's something strange [32:48] going on on that website if you see that [32:50] the the website has only three words [32:52] maybe is it worth training on it maybe [32:54] not if it has like 10 million words [32:56] maybe there's something also [32:58] wrong going on that page um so a lot of [33:00] rules like this yes why we filter out [33:03] undesirable content from our dat set [33:05] instead of kind [33:06] of putting it in is like a supervised [33:09] loss right like can we not just say like [33:12] you know here's this like hate speech [33:13] website let's actively try to Let's [33:17] actively penalize the for generating [33:19] we'll do exactly that but not at this [33:22] step that's where the posttraining will [33:24] come from uh pre-training um the idea is [33:28] just to say I want to model kind of how [33:32] humans speak essentially um and I want [33:35] to remove all these like headers photos [33:36] and and menus and things like this but [33:38] it's a very good uh like idea that you [33:41] just had and that's exactly what we'll [33:42] do [33:44] later Next Step modelbased filtering so [33:47] once you filtered a lot of data what you [33:49] will do uh that's actually a very cute [33:51] trick uh you will take all of Wikipedia [33:53] and you will look at all the links that [33:56] are linked through Wikipedia p [33:58] because probably if something is [33:59] referenced by Wikipedia it's probably [34:01] some high quality website and you will [34:03] train a classifier to predict whether [34:06] something comes from whether a document [34:09] comes from one of these references uh [34:12] from Wikipedia or whether it's from the [34:14] random web and you will try to basically [34:16] say I want more of the things that come [34:19] from Wikipedia references does that make [34:22] sense so yeah so you will train a a [34:25] machine learning uh model usually also [34:27] very simp simple models because you need [34:28] to do that really at scale I mean just [34:30] think about the 250 billion [34:32] Pages uh next one you will try to [34:36] classify your data into different [34:38] different um domains you will say okay [34:41] this is entertainment this is books this [34:43] is code this is like these type of [34:45] domains and then you will try to either [34:49] um up or down weight some of the domains [34:52] uh for example you might say uh you [34:54] might see that actually if you train [34:56] more on code then actually your model [34:58] becomes bettered on reasoning so that's [34:59] something that people usually say in a [35:01] very handwavy way if you train your [35:03] model more code actually it helps [35:04] reasoning so you want to upweight the [35:07] coding uh distribution because that [35:09] helps for General language modeling [35:10] skills uh books is usually also another [35:13] one that people usually um upweight [35:16] entertainment they usually downweight uh [35:18] so things like this of course you want [35:20] to do it so people used to do it maybe [35:23] uh kind of theistically now there's [35:25] entire pipelines that we'll talk about [35:28] of how to do these things uh slightly [35:30] more um [35:32] automatically and then at the end of [35:34] training uh usually train um after [35:38] training on all of this data that we saw [35:40] usually train on very high quality data [35:42] at the end of of training your large [35:45] language model where you decrease your [35:46] learning rate uh and that basically [35:48] means that you're kind of overfitting [35:50] your model on a very high quality data [35:52] so usually what you do there is like [35:54] Wikipedia you basically overfit on [35:57] Wikipedia yeah and you overfit on like [36:00] human uh data that was collected um the [36:04] other things like continual pre-training [36:06] for getting longer context I'm I'm going [36:08] to skip over all of these things uh but [36:10] I just to give you a sense of how hard [36:12] it is when people just say oh I'm going [36:13] to train on internet that's a lot of [36:16] work um and really we haven't figured it [36:18] out yet so collecting World data is a [36:22] huge part of practical large language [36:24] model uh some might say it's actually [36:26] the key yes [36:28] about data so basic question so usually [36:30] when you start with like the terabyte of [36:33] data after I go through all that steps [36:35] the typical amount of data you have in [36:37] and then like how how large a team does [36:40] it typically think to go through all the [36:42] steps you talk about so how is the [36:44] question how large is the data after you [36:46] filter yeah after you filter and then to [36:48] go through all the step how large a team [36:50] do you need to go through like the the [36:52] other fation sttion uh how slow is it or [36:56] how like how how many people would you [36:58] need to be able to do this uh okay [37:02] that's a great question I'm going to [37:04] somewhat answer about the data uh how [37:06] large is the data set uh at the end of [37:08] this slide uh for number of people that [37:12] work on [37:13] it um that's a good question I'm [37:15] actually not quite sure but I would [37:18] say yeah I actually don't quite no but I [37:22] would say it's probably even bigger than [37:24] the number of people that work on kind [37:26] of the two tuning of the pre-training of [37:28] the model uh so the data is bigger than [37:31] kind of the modeling aspect um yeah I I [37:35] don't think I have a good sense I would [37:38] say probably in Lama's team which have [37:40] like 70 years people I would say maybe [37:42] 15 work on data uh I yeah all these [37:47] things you don't need that many people [37:48] you need a lot of computer so because [37:49] for data you need a lot of CPUs um so [37:53] yeah and I'll answer the second question [37:54] at the end of this slide so as I just [37:57] kind of alluded to really we haven't [38:00] solved data at all for pre-training so [38:02] there's a lot of research that that has [38:03] to be done first how do you process [38:05] these things super efficiently uh second [38:07] how do you balance kind of like all of [38:09] these different domains uh can you do [38:11] synthetic data generation that's [38:12] actually a big one right now uh and [38:15] because we don't have uh we'll talk [38:16] about that later we don't have enough [38:18] data on the internet um can you use [38:21] multimodal data instead of just text [38:23] data and how does that improve even your [38:25] text performance um [38:28] there's a lot of seccy because really [38:30] this is the key of most of the pre-train [38:32] pre-trained large language models so for [38:34] competitive Dynamics uh usually these [38:37] these um these companies don't talk [38:40] about how they do the data collection [38:41] and also there's a copyright liability [38:43] issue they definitely don't want to tell [38:44] you that they've trained on books even [38:46] though they did um because if not you [38:48] can uh sue them uh common academic [38:51] benchmarks uh so that will kind of [38:53] answer what you asked um it started so [38:55] those are the smaller ones it's the [38:57] names are not that important but it [38:59] started from around 150 billion tokens [39:02] which around uh 800 GB of data now it's [39:05] around 15 trillion of to 15 trillion [39:07] tokens which is also uh the size of the [39:10] models that are right now the best [39:12] models are probably trained on that [39:13] amount of data so 15 trillion tokens uh [39:16] which is probably I guess two order of [39:19] manage bigger than that so 80 uh E3 gab [39:23] so that would be [39:25] around 100 to thousand times uh [39:28] filtering of the common crawl if I'm not [39:31] mistaken um so yeah one very one very uh [39:35] famous one is the pile so this is [39:37] academic Benchmark of the pile and we [39:39] can just look at what distribution of [39:41] data they have it's things like um [39:44] archive PBM Central uh which is all the [39:47] the biology stuff uh here it's Wikipedia [39:52] you see stack exchange um some GitHub [39:56] and some books and things like this um [39:58] again this is on the smaller side so [40:00] this is if we look at here this is on [40:01] 280b so in reality it's like 100 times [40:04] bigger so you cannot have that much of [40:05] GitHub and and of [40:07] Wikipedia um in terms of close Source [40:10] models just to give you an idea uh Lama [40:13] 2 um it was trained on 20 two trillion [40:16] tokens lamb 3 15 trillion tokens which [40:19] is currently the best model that we know [40:21] on how much it was trained on which is [40:22] the same thing as this the the the best [40:25] academic or the biggest academic [40:27] Benchmark which is 15 trillion tokens [40:29] GPD 4 we don't really know but it's [40:30] probably in the same water of magnitude [40:32] or it's probably around that actually [40:33] it's probably around 13 um from leaks if [40:36] the leaks are true [40:39] um great so scaling laws um any other [40:43] questions on Data before you go to [40:45] scaling [40:48] laws sorry I know I'm giving you a lot [40:50] of information but uh there's a lot into [40:52] training at large language models great [40:55] scaling laws so so the idea is that what [40:58] people saw um around 2020 or at least [41:01] from a long time but they've been able [41:03] to kind of theoretically show it or [41:06] impurely show it since 2020 is that the [41:08] more data you train your models on and [41:10] the larger the models the better the [41:11] performance this is actually pretty [41:13] different than what you've seen in this [41:15] class in this class we teach you about [41:16] overfitting overfitting doesn't happen [41:18] with large language models uh larger [41:21] models better performance um it's [41:24] something that really took a long time [41:25] for the community who took this type of [41:27] class to realize um but for the exam [41:31] overfitting [41:32] exists so okay the idea of scaling laws [41:36] is that if given that you know that more [41:38] data and larger models will always give [41:41] you better performance can we predict [41:44] how much better your performance will be [41:46] if you increase the amount of data and [41:48] the size of your model and surprisingly [41:51] it works uh so here you see three plots [41:53] from a very famous paper called scaling [41:55] loss from openi um here you see on the [41:58] x-axis compute so how much did you train [42:01] like how much compute did you did you [42:02] spend for training and here you see test [42:04] loss so this is essentially I mean it's [42:07] not perplexity but it's your validation [42:08] loss um so it's a log of the perplexity [42:11] and if you put these two on uh log scale [42:15] uh then you see that uh the the [42:17] performance or like the this the sorry [42:20] the the scaling law is linear uh that [42:22] means that if you increase your compute [42:25] by a certain amount you can you can say [42:26] by how much your test loss will actually [42:29] decrease same thing with data and same [42:32] thing for parameters if you increase the [42:34] data set size your loss will will [42:36] decrease by an amount that is somewhat [42:39] predictable if you increase the number [42:40] of parameters it will decre the loss [42:43] will decrease by amount which is [42:44] somewhat predictable this is really [42:46] amazing um very surprising I mean it [42:49] looks in nocuous when you look at these [42:51] type of plots but that's crazy because [42:53] it means that you can predict uh how [42:55] well we're going to perform in 2 3 years [42:58] depending on how much compute we will [42:59] add assuming that these things will hold [43:01] there's nothing theoretical about it um [43:04] yes two things one what is the loss that [43:07] they're using here is this perplexity or [43:09] so it's it's you know I said perplexity [43:11] was like two to the power of the LW so [43:13] this is the the the power of the [43:16] perplexity and then the second thing is [43:18] when you like increase the number of [43:20] parameters or you increase the total [43:21] data set size going dat times doesn't [43:25] that just inherently increase your [43:27] compute like do all this work to [43:31] just specific no this is a great [43:33] question so the compute here is actually [43:35] a factor of two things the data and the [43:37] parameter what I'm showing here is that [43:39] you can um well actually we're going to [43:40] talk about that in details but basically [43:42] if you increase the number of parameters [43:44] you should increase the number of data [43:46] that you have um so you actually don't [43:49] go multiple times through the same data [43:50] set no one does EPO in a lar at least [43:55] not yet uh because we have still kind of [43:58] enough data um so yeah this is all the [44:01] same Trend which is increase compute [44:03] decrease [44:04] loss yes have we seen the numbers for [44:07] the last two years or is it still [44:10] holding it is still holding I I don't [44:14] have like good numbers to show you uh [44:16] but it is still holding [44:20] surprisingly yes is there no evidence [44:22] like empirical evidence that you [44:25] plateau expected PL [44:28] no empirical evidence of plateauing [44:30] anytime soon um why we don't know um [44:36] will it happen probably I mean it [44:38] doesn't need to because it's actually in [44:39] log scale so it's not like as if it had [44:43] to go it had to Plateau like [44:45] mathematically it could continue [44:46] decreasing like this I mean most people [44:48] think that it will probably Plateau at [44:49] some point we don't know [44:51] when um okay so that's I'll talk more [44:55] about scaling laws now [44:57] so why are scaling laws really cool [45:00] imagine that I give you um you're very [45:02] fortunate I gave you 10,000 gpus for [45:04] this month what model will you train how [45:07] do you even go about answering that [45:08] question and I mean this is a a [45:11] hypothetical but that's exactly what [45:13] these companies are faced with uh the [45:16] old pipeline um which was basically you [45:19] tune High parameters on the big models [45:21] so let's say I have 30 days I will train [45:24] 30 models for one day each I will pick [45:27] the best one uh and that will be the [45:29] final model that I will use in [45:31] production um that means that the model [45:33] that I actually used was only trained [45:34] for one day the new pipeline is that you [45:38] first find a scaling recipe so you find [45:40] something that tells you for example oh [45:43] like one common thing is that if you [45:44] increase the size of your model you [45:46] should decrease your learning rate so [45:47] you find a scaling recipe such that you [45:49] know if I increase the the the the size [45:52] of my model here's what I should do with [45:53] some high parameters then you tune your [45:56] high parameter [45:57] on smaller models of different sizes [46:00] let's say I will say for 3 Days of my 30 [46:03] days I will train many different models [46:05] and I would do highper parameter tuning [46:07] on these small models each of different [46:08] sizes then I will fit a scaling law and [46:11] try to extrapolate from these smaller [46:14] models which one will be the best if I [46:17] if I train it for much longer or sorry [46:20] if I train it for a larger model and [46:23] then I will train the final huge model [46:24] for 27 days instead of just one day [46:27] um so the new pipeline is not train [46:31] things or do high prity tuning on the [46:33] real scale of the model that you're [46:34] going to use in practice but do things [46:36] on smaller ones at different scales try [46:39] to predict how well they will perform [46:41] once you make them bigger I will give I [46:43] will give you a very concrete example [46:45] right now uh let's say Transformers [46:48] versus lstms let's say you you have [46:50] these 10,000 gpus you will not sure [46:52] which one you should be using should I [46:53] be using Transformer based model or LCM [46:55] based model what I will do is I will [46:57] train Transformers at different skills [47:00] so here you see different parameters on [47:01] the x-axis Y axis is my test loss I will [47:04] then train different different lstms at [47:07] different scales once I have these [47:09] points I will see oh it kind of fits a [47:11] scaling law I will fit my scaling law [47:13] and then I will be able to predict oh if [47:16] I had 10 times more compute here's how [47:19] well I would perform for the LM it's [47:21] actually slightly less linear for the [47:22] lstm but like you could probably try to [47:25] predict where you would end up and [47:26] clearly from this plot you would see [47:28] that Transformers are better um one [47:31] thing to notice when you read these type [47:32] of scaling laws is that are two things [47:34] that are important uh one is really your [47:38] scaling rate uh which is kind of the uh [47:42] the slope of the the slope of the [47:45] scaling law the other thing is your um [47:48] your intercept like you could start [47:51] worse but actually become better over [47:53] time it just happens that lstms are [47:55] worse for both uh but I could show you [47:57] another one where things you can predict [47:59] that actually after a certain scale [48:01] you're better off using that type of [48:03] model than others uh so that's why [48:05] scaling laws are actually really [48:07] useful any questions on [48:11] that yeah so these are all kind of very [48:15] how how sensitive are these to like [48:17] small differences in the architecture [48:18] like one one like Transformer [48:21] architecture versus another Transformer [48:23] architecture you basically have to like [48:25] fit your own curve and make basically [48:27] say like oh scaling law has tell me [48:28] there should be some like logarithmic [48:30] function let me extrapolate that for my [48:34] own yeah so uh usually for example if [48:37] you're an academic and you want to now [48:39] at least that's like pretty recent and [48:41] you want to propose a new like [48:42] activation uh that's exactly what you [48:44] will do you will fit a scaling law show [48:46] another scaling law with the standard [48:48] like I don't know G and you will say [48:50] that it's better in reality once you [48:51] start thinking about it in scaling loss [48:53] terms you really realize that actually [48:55] all the architecture differences that we [48:57] can make like the small minor ones all [48:59] they do is maybe change a little bit the [49:01] The [49:02] Intercept but really that doesn't matter [49:05] uh cuz just train it for 10 hours longer [49:07] or like wait for the next uh for the [49:09] next Compu gpus and these things are [49:11] really secondary which is exactly why I [49:12] was telling you originally people spend [49:14] too much time on the architecture and [49:15] losses um in reality these things don't [49:18] matter as much data though if you use [49:20] good data you will have much better [49:22] scaling loss than if use bad data so [49:24] that really matters [49:27] uh another really cool thing you can do [49:28] with scaling laws is that you can ask [49:30] yourself uh how to optimally allocate [49:33] training resources should I train larger [49:36] models because we saw that it's better [49:38] when you train larger models but we saw [49:40] that it's also better when you use more [49:41] data so which one should I do should I [49:44] just train on more data a smaller model [49:45] or should I train a larger model on less [49:47] data um so chinchilla is a very famous [49:52] paper that first showed this uh the way [49:54] they did it I want to give you a little [49:55] bit of a sense of what these plots are [49:58] uh here you see training loss again on [50:00] the x-axis you see parameter parameter [50:02] differences uh sorry parameter size uh [50:04] number of parameters so the size of the [50:05] model and here all these curves are what [50:07] we call isof flops which is that all the [50:11] models on this curve H have been trained [50:14] with the same amount of [50:16] compute um the way that you do that is [50:18] that you train you change sorry you vary [50:20] the number of tokens that we trained on [50:22] and the size of the models but you vary [50:24] in such a way that the total compute is [50:26] constant [50:27] okay so all these curves that you see [50:28] with different colors have different [50:30] amount of computers that were trained on [50:32] then you take the best one for each of [50:34] those curves once you have the best one [50:36] for each of those curves um you can ask [50:40] you can plot um how much flops it was [50:44] and which curve were you on and how much [50:46] parameters did you actually use for [50:49] training that specific point you put [50:51] that on the on the log log uh scale [50:54] again and now you fit a scaling law [50:56] again so now I have something which [50:58] tells me if I want to train a model of [51:01] 10^ 23 flops here's exactly the number [51:04] of parameters that I should be using 100 [51:07] 100b and you can do the same thing with [51:09] flops and [51:10] tokens so now you can predict if if I [51:14] tell you exactly I have one month of [51:16] compute what size of model should I be [51:18] training F your scaling law and I tell [51:20] you um of course that all looks [51:23] beautiful in reality like there's like [51:25] there's a lot of like small things of [51:26] like should you be counting like [51:27] embedding parameters like there's [51:29] there's a lot of complexities but if you [51:31] do things well these things actually do [51:34] hold um so the optimal number of [51:37] parameters that that chinchilla Pap have [51:39] found is to use 20 tokens for every [51:42] parameter that you train uh so if you [51:44] add one more parameter you should add [51:45] you should train your thing on your [51:47] model on 20 more tokens so one caveat [51:50] here is that this is optimal training [51:52] resources so that is telling me if you [51:54] have 10^ 23 FL [51:57] or if you have like 100 I don't know how [51:58] much that is100 million or 10 no that's [52:02] much less actually let's say I have $5 [52:03] million to to train my best model that [52:06] gets the lowest loss how how what would [52:08] I train on in reality these companies [52:11] need to think about inference also if [52:13] you have a smaller model they will spend [52:16] less over time um so actually if you [52:18] consider the inference cost you have [52:20] other papers that Tred to show that um [52:22] it's around [52:24] 150 uh parameters per sorry tokens per [52:27] parameters because you prefer having a [52:29] smaller model cuz over time you're going [52:32] to you're going to actually um spend [52:35] less money on inference of these models [52:37] so 150 to one that's around what the [52:40] best models are trained on right now at [52:43] least the ones that are that are used um [52:46] in practice for in [52:49] production [52:51] great any question on [52:55] chin great oh sorry in practice how [52:58] expensive is inference for these models [53:01] rela to [53:02] train actually very expensive uh I will [53:05] not talk about inference because that [53:06] would be another entire lecture but just [53:09] think about Chad GPT where they have I [53:12] don't know how much it is now like 600 [53:14] million people that used it um like [53:19] that's a lot [53:21] um yeah so it's actually very expensive [53:24] there's a lot of optimization you can do [53:26] for in though um and that's an entire [53:28] other lecture so I'm going to skip that [53:30] uh this time but it's very [53:32] interesting okay tuning um as I said [53:35] there are many things that you can uh [53:37] answer with scaling laws I just try to [53:39] give you two examples uh but really [53:41] there are many things what data do you [53:43] use what mixture what data mixing [53:45] waiting you use data mixtures that's [53:47] what we talked about before uh what [53:49] architecture you use whether you should [53:51] make your models uh wider or deeper um [53:54] should you be paying for more gpus or [53:56] actually collecting more data um all [53:59] these things are things you can try to [54:00] answer with scaling [54:02] laws one thing I want to say is the bit [54:05] lesson if you ever heard of Richard [54:07] sudden a very famous blog post in 2019 [54:10] um what he realized uh which I think not [54:15] enough people realize I didn't [54:17] definitely did not realize at that time [54:19] um is that once you see these type of [54:21] scaling laws you know that the more [54:23] compute you have the better models you [54:25] will get so with skill you will get [54:27] better model and you also know by Mo law [54:30] or these type of variant of Mo law that [54:32] you will always have better compute then [54:34] the only thing that matters is just to [54:37] have architectures that can leverage [54:39] computation so what matters is basically [54:41] systems data and less so the [54:44] architecture like the small architecture [54:46] differences like your your your [54:47] activation and things like this uh so I [54:50] think that's like one of the reasons why [54:51] most of research focuses on um some [54:54] things that for industry matters less [54:56] and I was one of those researchers for a [54:59] large part of my my career um so don't [55:02] spend time over complicating do the [55:05] simple things do it well seal them [55:08] that's really what openi taught us with [55:10] um with chat gpg and with all the gpts [55:14] before okay I want to give you some [55:16] backup the envelope computation so I [55:19] might be off by a few factors here but I [55:20] just want to give you a sense of how [55:22] costly it is to train some of these [55:24] models I'll give as an example [55:26] Lama 3 400b which is currently the best [55:29] open source model that you can get uh it [55:32] was trained on 15.6 tokens it has 45 [55:36] billion parameters so just now that you [55:38] know what is like this uh optimal tokens [55:41] per parameter that's around 40 so that's [55:43] a little bit more than chinchilla but [55:45] less than this like inference uh optimal [55:49] um model so they went for training [55:52] optimality uh flops for this model so [55:55] one simple uh way to compute flops is [55:57] six uh times the number of parameters [56:00] times the number of data you train on uh [56:03] so if you do the simple calculation here [56:04] it's 3.8 e25 flops the reason why this [56:08] is important is that if you follow the [56:10] little bit the news there's an executive [56:11] order from Biden that basically says [56:13] that once you have uh 1 e26 parameters [56:17] uh sorry flops uh then you have special [56:20] scrutiny on your models so they went 2x [56:22] less than that so they really went right [56:24] below this to not have special scrutiny [56:27] so 38 uh I might be off by a little bit [56:29] but it's definitely under the 1 [56:34] 26 oh um so paramet p is parameters n is [56:40] data number of tokens this is a uh this [56:43] is just an [56:44] approximation we [56:47] yeah okay uh compute and we know that [56:51] they trained on 16,000 [56:53] h100s um and we know the throughput but [56:56] they they said it too uh so if you do [56:59] the computation it takes around 70 days [57:02] um or 26 million GPU hours at least [57:05] that's with my uh back of the envelope [57:07] computation they actually said that they [57:09] use 30 million instead of 26 million GPU [57:12] hours um so maybe they had like some uh [57:16] some challenges I don't really know but [57:18] if you follow the simple computation [57:20] it's around 70 days um cost uh I mean [57:24] this it's hard to to approximate but I'm [57:27] just going to say it's kind of the rent [57:29] like what if I were to rent h100s that [57:32] many h100s for that many days how much [57:35] will I pay uh h100 a lower bound on the [57:38] on the renting uh cost of h100 is around [57:41] 2 hours uh $2 per hour so if you [57:44] multiply this by 26 million uh hours uh [57:48] you get 52 million uh dollars so they [57:51] probably pay less than that but not [57:53] actually much less because all these um [57:57] all these services that actually rent [57:58] gpus they don't make that much money so [58:00] it's it's probably slightly less but not [58:02] that much less um now salary I said 50 [58:06] employees 500k per [58:09] year say yeah it's probably the right [58:11] ballpark 25 million uh so if you put all [58:14] together around 75 million um dollars [58:17] for [58:18] training uh this Slammer model I'm [58:21] probably off by like 10 million but but [58:23] that's kind of right uh bpk [58:27] carbon emitted um a lot of people might [58:30] ask like also the cost is not the only [58:32] thing that is important so I did the [58:34] computation um it's around 4 uh 4,000 um [58:40] tons of CO2 equivalent that is actually [58:43] only 2,000 return tickets from JFK to uh [58:46] London so right now uh carbon emitted is [58:50] actually not uh I mean it's huge but [58:52] it's not like um meaningful yeah yet I [58:56] think in maybe GPT 6 gpt7 once you [59:01] multiply this by 100 that might become a [59:03] real issue right now it's still not uh I [59:06] think um an issue in the grand scheme of [59:08] things next model the way you should be [59:11] thinking about these models is that [59:12] every new generation the number of flops [59:15] essentially uh multiplies 10x or at [59:17] least that's what they try uh if they [59:19] have enough energy and if they can buy [59:21] enough [59:22] gpus uh great any question on these back [59:25] of the envelope math [59:29] no [59:31] okay so now we talked about pre-training [59:34] I wanted to also chat about systems [59:36] because now we know computer is really [59:38] important so there's a question of how [59:39] do you optimize the how do you optimize [59:42] your computer I will leave that for the [59:44] end because I'm not sure how much time [59:45] we will have I think it's important but [59:47] hopefully I I'll be able to to talk [59:49] about it later it's slightly different [59:51] than what we've been talking about right [59:53] now so I'll move on to post training for [59:55] now [59:56] so the task of post training ER the [59:59] reason why we need to do Post training [60:01] is as I told you before um it's to make [60:04] AI assistants so language modeling is [60:07] not uh really the thing that you want [60:10] when you have an AI assistant uh for [60:12] example if you ask to gbd3 which is a [60:15] purely language Model A pure language [60:17] model not a um not an aligned one if you [60:20] ask a question like explain the moon [60:22] landing to a [60:23] six-year-old the completion that you [60:25] would get is something like explain the [60:27] theory of gravity to a six-year-old [60:29] because what it learned is that on on on [60:31] internet if you have one question you [60:33] usually have maybe another bullet point [60:35] of other similar questions you don't [60:37] usually have question and then answer [60:38] later uh this is not what you want from [60:41] an AI assistant so how do we uh do this [60:45] alignment which is this post training [60:46] and making these models [60:48] assistance um so the goal of this [60:51] alignment is to basically get LMS follow [60:54] the instructions that are given um by [60:56] users and and maybe some designers kind [61:00] of desires um so think about moderation [61:03] you don't want the model like open ey [61:05] definitely doesn't want the model to say [61:07] stuff that is very [61:08] toxic um so here you see on the left [61:11] hand side uh that when you ask a [61:13] question it actually provides a a real [61:15] answer so it's not like uh before the [61:17] llm and on the right hand side you see [61:19] that it would if you ask to write a [61:21] tweet describing how a certain part of [61:25] the population are evil it will say that [61:27] it cannot do that um so that's kind of [61:31] this [61:31] alignment uh the background here is that [61:35] uh basically the data that you want for [61:38] training some of these models um is like [61:41] we know what we want which is just [61:43] asking humans this is a question this is [61:44] the answer that you want uh but the [61:46] thing is that it's very expensive to [61:48] collect that data and it's hard to find [61:49] it online uh in contrast pre-training [61:52] data is not what you want but there's a [61:55] lot of it um so what what we will do a [61:57] the main idea is simply take a pre-train [62:00] large language model pre-train all of [62:02] internet and then you just fine tune so [62:03] you just change a little bit of weights [62:05] on the type of data that you actually [62:06] want and hopefully given it you already [62:08] pre-train it on all of Internet it [62:10] basically learns or knows how to speak [62:12] in English and and knows a standard um [62:16] language syntax uh then you can really [62:19] find tune in with very little [62:22] data okay sft so supervis fine tuning is [62:26] really exactly what I just said which is [62:27] the idea of fine-tuning the large [62:29] language model on uh basically the [62:32] desired answers that are collected from [62:34] humans um so why is it called supervis [62:37] fine tuning because you basically want [62:38] to do language modeling on the real [62:41] ansers so language modeling is this like [62:42] next word prediction and and that's the [62:44] fine-tuning part and then you want to do [62:46] it on desired answers given by humans so [62:48] that's why we call it [62:50] supervis so how do we collect this data [62:52] well we I just said it you just ask [62:54] humans uh to to tell you this is the [62:56] this is a question this is the answer [62:58] that you uh you would want from some of [63:00] these models so this is an example um [63:03] sorry I can't read very well on my [63:04] computer but uh my kid uh needs to do a [63:07] science um no let's read this one can [63:09] you write a short introduction about the [63:12] relevance of the term monopsony and then [63:14] it says monopsony refers to a market [63:15] structure blah blah blah and that's a [63:16] human that wrote that um so actually [63:19] this is open Assistant which was a a way [63:22] to collect um uh data online by [63:27] humans so this type of supervised fine [63:30] tuning or alignment is really the key of [63:32] Chad GPT this is what made uh the big [63:35] jump from gpt3 which was mostly [63:37] something that was known by AI [63:39] researchers to Chad GPT which became [63:42] known by basically [63:44] everyone [63:46] um so the problem with uh human data is [63:51] that it's uh very slow to collect and [63:54] very expensive um so [63:57] one possible simple idea is to use llms [64:01] to scale data collection uh so that's [64:03] exactly what we did with alpaca uh one [64:06] year ago what we did is that we asked uh [64:08] humans or we use a data set of human uh [64:11] question answers so there were 175 uh [64:14] question answers here and we asked the [64:15] best mod at the time so text3 to [64:18] basically generate many more of these [64:21] question and answers so all we did is [64:23] like this is what humans would write now [64:25] write similar answers and similar [64:26] questions and we collected 52,000 LM [64:30] generated question answers and then what [64:32] we did is simply we took Lama 7B which [64:34] was the best pre-train model at the time [64:36] and we just fine- tuned this with [64:38] supervised fine tuning as I told you and [64:40] that's how we got um the Alpac s7b [64:43] model uh and this is the type of data [64:46] that we collected so things like what [64:48] does algorithm mean an algorithm is a [64:50] step by a stepbystep uh set of [64:52] instruction used to solve a problem or [64:54] achieve a goal blah blah blah blah so [64:56] the data is not actually it's actually [64:58] pretty good given it was LM generated by [65:00] LMS from essentially two generations ago [65:04] um so that really started at least for [65:07] us kind of as an academic replication of [65:09] chat GPT uh now it really there's a big [65:12] field of like synthetic data generation [65:14] of how to use llms to basically make [65:18] development of llms faster um and by [65:21] basically by decreasing the amount of of [65:23] human hours that you need [65:27] quantity of data so we talked about what [65:29] type of data and how we collect it um [65:31] one thing which is surprising with sft [65:33] is that you don't need that much data uh [65:36] so what this paper showed this is called [65:38] Lima is that if you have if you scale [65:41] the amount of data that use from uh [65:43] supervised fine training from 2,000 to [65:45] 32,000 it really doesn't help much so [65:48] here scaling laws definitely don't help [65:50] um so the the intuition here is that all [65:53] you learn um is is you learn how to [65:56] format your desired answers another way [65:59] of saying it is that your pre-trained [66:01] models they essentially model the [66:03] distribution of every user on internet [66:05] one that might write bullet points [66:07] another one that might answer qu answer [66:09] question with an answer so all you tell [66:11] your model is like wait you should [66:13] actually be optimizing more for this [66:15] type of user than another one so you're [66:17] not actually teaching it and you're not [66:19] teaching anything through this um sft uh [66:23] so supervis fine tuning all you do is [66:25] you tell the model to kind of optimize [66:27] for one type of user that it saw already [66:29] in a pre-train data set so the knowledge [66:31] is already in the pre-train llm uh and [66:33] you basically just specialize to one [66:35] type of [66:36] user great any question on [66:40] sft yes so I know it's a big issue with [66:44] synthetic data where uh if you keep [66:47] generating data from the same [66:49] distribution eventually you're not [66:50] learning a new distribution you're [66:51] essentially playing with it it just [66:52] bootstrapping that yeah surely [66:56] you can't scale that forever right you [66:57] can't keep going on and generating from [66:59] the same distribution you hope to learn [67:01] something new yeah uh so are there it's [67:03] an active area of research but any [67:05] thoughts that you have around how people [67:07] are maybe thinking around this and uh [67:10] better ways to bootstrap or to give up [67:12] on this idea and and realize that the [67:14] chart shows you don't need that many so [67:16] just get humans to generate 2,000 really [67:18] good uh yeah so that's a very good [67:20] question uh so for the data stuff so I'm [67:23] saying it's not that important for sft [67:24] but there will be another thing we'll [67:25] talk about right after where actually [67:28] data does [67:29] matter my intuition based on not that [67:32] much empirical results is that you can [67:34] still get um even though you use your [67:37] LMS if you use purely LM generated text [67:40] and you do that for like three four [67:42] generations of llms I agree with you [67:43] that probably you won't improve much but [67:46] for me what is important is how do you [67:47] use like human in the loop with llms not [67:50] purely LMS not purely uh humans but [67:53] maybe what you can do is just have the [67:54] model generate some new text and just uh [67:57] humans write a few Edits edits are much [67:59] faster than writing the entire text and [68:02] I think that if you have that type of [68:03] collaboration then from like kind of an [68:05] information theoretical point of view [68:07] you still get additional information but [68:09] you still much faster than if you use [68:11] humans and I think that as a field we'll [68:13] probably move towards these type of [68:14] things uh which is um really just [68:17] finding the examples that are important [68:19] and and asking humans it's kind of [68:21] active learning just asking humans [68:22] exactly when uh you need to to get [68:27] inputs yes do we train with like the [68:29] same loss function the same like General [68:32] training algorithm for the supervis [68:33] tuning bit as we do for the for the [68:35] pre-training right because like the [68:37] examples you showed I think the the [68:39] important thing of the good examples is [68:43] they're like supera accurate there's [68:45] these more complex still just like chain [68:48] same so that's why here I yeah I didn't [68:51] maybe didn't emphasize enough this is [68:52] just language modeling fine tun the LM [68:54] with language model on the desired [68:56] answers so this is literally the same [68:57] loss um it will be different in two [69:01] seconds but the first step of sft is [69:03] literally the same loss where you just [69:05] say Okay I want to actually specialize [69:07] on that type of data so there's even a [69:09] question of like what is pre-training [69:10] what is post-training because in reality [69:12] it's just like a different data that you [69:13] use the reason why we usually call it [69:15] post training is that the way we collect [69:16] that data is very [69:18] different great great questions uh yes [69:22] maybe it's the same question but why [69:23] would these 2,000 examples have such an [69:26] overweighted [69:28] influence you tun so that's why we uh [69:31] also that's another reason why we call [69:33] it post training is that we use [69:34] different type of hyper parameters so [69:35] you know I told you basically at the end [69:37] of pre training you essentially end up [69:38] with a learning rate of zero and here [69:40] you're going to increase your learning [69:41] rate so like 1 eus 5 one E Yeah and and [69:44] so um the weight that you give to them [69:47] is actually [69:49] different [69:51] um okay uh Second Step or second part of [69:56] this post training um is what we call [69:59] reinforcement learning from Human [70:00] feedback or rhf uh some of you might [70:03] have heard of that um the idea is that [70:06] sft has a problem namely that uh you do [70:09] behavioral cloning which means that you [70:11] just try to clone what the humans would [70:14] say and that had that has many issues [70:16] one of them is that you're bound by [70:18] human abilities so if um like humans [70:23] actually humans won't generate the [70:26] things that they think is actually the [70:27] best thing to generate so if you ask me [70:29] to write a book I mean I can definitely [70:31] enjoy a book I can probably say one book [70:33] is better than another but I'm [70:34] definitely not going to be as good as [70:36] writing the book that I want to read uh [70:38] so you're going to be bound by the human [70:39] ability to generate things even though [70:41] the humans might be better at [70:42] distinguishing between things that's one [70:44] issue issue number two uh I find that [70:46] actually pretty interesting is that it [70:48] might if you ever heard of the word [70:50] hallucination so this is llms generating [70:53] F like false information [70:56] hallucination might these people have um [70:58] hypothesized that that can come from the [71:00] supervised fine tuning even if you do [71:02] supervised fine tuning on data that is [71:05] correct and the reason why that is is [71:08] that if uh given I told you that [71:10] basically sftt is with very little data [71:13] and it's with data that doesn't the [71:15] model doesn't learn anything new so what [71:17] if the human gives an answer that the [71:20] model didn't know was true from the [71:23] model perspective you the human [71:25] basically is telling the the model uh [71:28] generate this thing that seems plausible [71:31] but actually have no idea if it's true [71:32] or not um so just to give you a very [71:35] concrete example if we go back to this [71:37] uh monopsony example can you write blah [71:39] blah blah about monopsony uh imagine [71:42] that a human uh wrote a reference on [71:44] this type of book um and that book might [71:47] exist that might be a correct reference [71:49] but what if the llm never saw this [71:51] reference during pre-training then it [71:53] doesn't know that it's a correct [71:54] reference so really what you tell the [71:55] model is to generate or make up some [71:58] plausibly sounding reference um rather [72:01] than actually tell the real reference [72:03] that it saw during pre-training uh so [72:06] hallucination might be um uh a re like [72:10] might be caused by this sft that's [72:12] problem number two does that all make [72:14] sense great problem number three price [72:18] generating the ideal answers is very [72:21] pricey and that comes back to your [72:22] question um of like humans writing [72:25] answer is actually pretty [72:27] expensive um so that's where rhf comes [72:29] in the idea is that instead of cloning [72:32] the behaviors of humans we're going to [72:34] maximize human preference um and the way [72:37] we're going to do that so the pipeline [72:39] is that for a certain for every [72:41] instruction you're going to ask a model [72:43] to generate two answers um and usually [72:47] use a pretty good model so you usually [72:48] don't use an LM here you use a sft uh [72:52] fine tune you use a fine tuned llm [72:54] already to give like pretty good answers [72:57] and then you ask labelers which of these [72:59] two answers was better so select the [73:01] preferred one and then with different [73:04] type of algorithms we're going to talk [73:05] about the algorithms um you just [73:07] fine-tune the model to generate more of [73:09] the green thing than the red thing so [73:10] more of the good stuff uh so now the [73:13] question is how and we're going to talk [73:14] about that right [73:16] now so there are two ways that we're [73:19] going to talk about and two that are [73:20] mainly used in the community um the [73:23] first one is simply the idea of of using [73:25] reinforcement learning so hopefully you [73:26] all know what reinforcement learning is [73:28] now um so when you think about using [73:32] reinforcement learning one important [73:33] question is like what is the reward that [73:35] we're optimizing uh so in this case [73:37] there are really two options that I [73:38] could think about the first one you [73:40] could just say I'm going to compare the [73:42] output generated by some baseline the [73:44] output generated by my model U and I'm [73:46] just going to ask the human to say which [73:48] one is better and I'm going to use this [73:51] as a reward so if I'm better than the [73:52] Baseline this is a plus one if not it's [73:54] a minus one one uh so now it's binary [73:56] reward the problem with binary reward is [73:58] that it's very sparse and you don't get [74:00] much information out of it uh like maybe [74:02] your answer was slightly better maybe it [74:04] was like way better and you don't really [74:07] know from this um how much better it was [74:11] so option two is that you can train what [74:13] we call a reward model which is simply a [74:15] classifier uh so you use machine [74:17] learning to to classify how much better [74:21] uh two outputs are from the preference [74:24] from the perspective of the human um so [74:27] this is a little bit meta but what you [74:29] basically do is that you train uh you [74:31] take um a reward model R which is a uh [74:34] just a large also a large um a large [74:38] classifier and you basically ask this [74:40] reward model you give it the input and [74:42] the actual output that you have one of [74:44] the two outputs uh and you just um [74:47] exponentiate that so that's the soft Max [74:49] law that you all know about and now you [74:51] divide by um the the exponential [74:55] reward uh on the first example sorry on [74:59] the first output and this is on the [75:00] second output and you basically train so [75:02] the reason why you do that is that you [75:04] train your your model you train this [75:06] reward model to be able to classify um [75:10] how much better one output is to another [75:13] one so another uh slightly less [75:15] convoluted way of saying it is that your [75:16] reward model will output some reward [75:19] that will be used as the logits of your [75:21] soft Max so now if you have high logic [75:25] in your softmax it means that you highly [75:27] likely this um output is [75:31] better uh so that's what we call Bradley [75:33] ter model yes is this reward model going [75:36] over the entire output or is it [75:39] going um so this takes the [75:43] entire uh yeah this takes the entire [75:46] output at once so it takes all the input [75:47] and all the output and it gives one [75:49] number [75:52] yes would human be sorry with the reward [75:56] model where would a human be like oh I [75:59] see okay sorry maybe I wasn't clear um [76:03] you train this reward model to fit this [76:06] green and and red preference from humans [76:09] so basically you train a classifier to [76:12] say whether the humans prefer red or [76:14] green uh but instead of using the binary [76:17] reward which is what the human would [76:19] tell you you basically use the logits of [76:22] the soft Max and the thing with the [76:23] logits is that that logits are [76:25] continuous so now you know that if your [76:27] reward model said it has high logits [76:30] then in some ways the human highly [76:32] prefer this answer to some other [76:36] answer great um so as I just said [76:39] continuous information so it's better so [76:41] that's what people uh use in practice or [76:43] at least used to use in practice I'll [76:45] tell you about uh the other algorithm [76:47] later uh so what you do at the end is [76:49] that you basically try to just use [76:51] reinforcement learning that you know [76:53] about now we know we have reward what [76:55] you sample through is the generation [76:58] from your large language model um and [77:00] then you just use some regularization [77:01] term so the reason why you do this [77:03] regularization term is for avoiding what [77:05] we call over optimization so this reward [77:07] model might not be really represent like [77:09] might not perfectly model human [77:11] preferences so you don't want to [77:12] maximize this thing to essentially [77:15] Infinity um and you do it using uh po [77:19] which is a common uh reinforcement [77:22] learning algorithm um one thing to note [77:25] here because it will be important for [77:26] later is that when we use maximum [77:29] likelihood [77:31] um sorry now the large language models [77:35] are actually a policy for your [77:37] reinforcement learning it's not [77:39] maximizing maximum likelihood anymore [77:41] which means that you're not modeling any [77:42] distribution anymore and the reason why [77:44] this is important is that models that [77:46] went through this type of Po actually [77:49] don't give you likelihoods of text that [77:51] are meaningful cuz what you optimize [77:53] them to do is B basically just optimized [77:55] for generating the most likely thing not [77:58] optimize for modeling like all the [78:00] answers that humans might say another [78:02] way of saying that is that there's [78:04] nothing that incentivizes here the model [78:06] to not give a like a um a single [78:10] possible generation nothing here says [78:13] it's good if you have some distribution [78:15] with some [78:16] entropy um okay if you haven't followed [78:18] it's not that important but just good to [78:21] knowe great so PO is exact what chat GPT [78:26] did originally so here's the on the blog [78:28] post or what they have is step one do [78:32] supervise fine training which now you [78:33] all know about step two train a reward [78:36] model on human preferences step three do [78:39] po multiple steps which is where you see [78:41] this this blue arrow so you continue you [78:43] train the model once with po you collect [78:45] new data you continue uh and that's why [78:48] and that's exactly what Chad GPT did uh [78:50] that was a big breakthrough between gpt3 [78:53] and Chad GPT [78:55] one thing to note is that uh P has many [78:58] challenges reinforcement learning is [79:00] something that's super nice [79:01] theoretically in practice anyone who [79:03] ever worked with reinforcement learning [79:04] knows it's such a mess uh there's a lot [79:07] of things like roll outs out of Loops [79:08] clipping so many complications um so [79:12] it's messy this is the idealized PO used [79:14] for LM settings so that's already much [79:16] more complicated than this expectation [79:18] we saw before and in practice it's [79:19] actually much more complicated so we [79:21] have one implementation of it that we [79:22] had to do and I'm not going to go [79:24] through it but basically you have like [79:26] so much stuff that you have to think [79:27] about when you implement that type of of [79:30] uh po algorithm so you have clipping [79:32] everywhere you have a lot of [79:34] complexities and things are not well [79:36] documented all this to say um that we're [79:40] going to there was a new method that was [79:41] proposed uh also from Sanford one year [79:44] ago called DPO which is essentially a [79:46] simplification of Po um and the way uh [79:51] what they did or the idea that they have [79:53] is that instead of using reinforcement [79:55] learning you can just maximize the [79:57] probability of generating the stuff that [79:58] you like and minimizing the probability [80:00] of the stuff that you don't like uh so [80:02] if you think about the human preference [80:04] the red and green maximize uh green [80:07] minimize red um so the loss is actually [80:10] this one uh where what you see this is [80:12] simply um some log of the model so this [80:17] is the likelihood of a model generating [80:18] the things that the human preferred [80:20] given the the inputs um and what you try [80:24] to do is basically [80:26] maximize uh the likelihood of generating [80:29] the things that you like minimize the [80:31] likelihood of the things that you don't [80:32] like um all the rest of the terms here [80:35] it's not too important it's actually [80:37] really not that complicated to [80:39] understand but at a high level it's [80:41] really just maximizing the things you [80:42] like minimizing the the rest um and one [80:46] thing to note uh which I was going to [80:48] say just here is that actually all the [80:50] rest is chosen such that um the global [80:53] Minima of of Po and a global Minima of [80:57] like this DPO under some assumptions are [80:59] essentially equivalent so this is the [81:02] right thing to do mathematically I'm not [81:04] going to go through the derivations but [81:06] that's the right thing to do uh it's [81:08] pretty different with Po in the sense [81:09] that now and with P what you had to do [81:11] is collect the human preferences then [81:13] train a uh reward model with maximum [81:15] likelihood then use reinforcement [81:17] learning now all you do is basically [81:18] maximum likelihood much simpler yes I [81:21] mean yeah so it seems like this is a [81:23] much simpler and B like what you just [81:25] intuitively do if this why did they [81:28] start with this reward model like what [81:30] what led them doing that I think it's a [81:32] great question uh I don't really know [81:34] what I can tell you is that at open ey [81:37] the people who did the um uh who did [81:40] basically this PP uh sorry who did Chad [81:43] GPT initially are the ones who actually [81:46] wrote Po and I think they were just like [81:48] there are a lot of reinforcement [81:49] learning people and I think that for [81:51] them it was very intuitive um so there's [81:55] also some additional like potential [81:57] benefits for example I don't want to [82:01] yeah for example if you use the reward [82:02] model uh the cool thing here with [82:04] reinforcement learning is that you can [82:05] use unlabeled data with the reward model [82:08] so here you can only use the label data [82:10] for doing DPO um for PP for po you first [82:14] train your reward model and then you can [82:16] use unlabeled data uh where the reward [82:18] model will basically label this [82:20] unlabeled data so there there's [82:21] additional kind of potential uh [82:25] there could be potential improvements in [82:27] practice it happens at down and on and I [82:29] think just that a lot of people in this [82:32] team were reinforcement learning experts [82:34] including uh the main author of Po John [82:37] hman um so much simpler in poo and is [82:41] basically performs as well uh so now [82:43] this is the standard uh thing that [82:45] people use at least in the open source [82:47] Community I believe it's actually the [82:48] standard also in in Industry so that's [82:52] called DPO gains [82:55] um so those are all the papers on the [82:57] left here this is on a summarization [82:59] task you see all I want to show you is [83:01] that basically the pre-train models uh [83:03] were okay and they improve with scale if [83:05] you do supervised fine tuning you [83:07] improve them a little bit more if you do [83:09] po or something with all HF with human [83:11] feedback you get performance that are as [83:14] often times depending on a benchmark [83:16] even better than uh humans so this is [83:19] the human uh reference summaries same [83:21] thing this is on a uh on a paper that we [83:23] have Alpaca Farm [83:25] where we see uh the evaluation here is [83:27] not too important but basically you see [83:28] pre-train model you jump to sft and then [83:31] you jump to PPO and popo have the exact [83:34] same [83:35] performance so basically all HF helps [83:38] that's kind of the conclusion and DPO is [83:41] simple uh data uh the way that you [83:44] collect that type of data um first idea [83:47] is just use humans as we already talked [83:50] about uh guidelines are very complicated [83:52] for what humans should be labeling and [83:54] and it's really not that easy and [83:55] actually if you ever do some of the [83:57] labeling you will see that it's [84:00] extremely complicated like if I zoom in [84:02] to this uh here I have a question tell [84:05] tell me about self-driving cars and you [84:07] read both self-driving cars are vehicles [84:09] that are capable of detecting their [84:10] surroundings blah blah blah self-driving [84:12] cars are cars that are equipped with [84:13] sensors blah blah blah to navigate [84:15] without the need for a driver I mean [84:17] both seem okay like which one is better [84:19] it's actually hard to say at a glance um [84:22] and as a result uh the problem with [84:23] humans is that you will start optimizing [84:27] a lot of like high level features for [84:28] example the second one is longer I can [84:30] guarantee you that most humans will [84:31] choose second one even though I mean [84:34] maybe the first one is better I don't [84:35] know I haven't read it carefully so [84:38] challenges with humans first slow and [84:41] expensive uh second as I just mentioned [84:44] it's hard to focus on things that matter [84:46] like correctness and people uh usually [84:49] look at things that don't matter as much [84:51] like the form like length uh and as a [84:54] result so what I show here is that uh [84:55] when you do lhf the more you do of lhf [84:58] the longer the output of the of the [85:00] models become so if you've ever been [85:02] annoyed at chat GPT answering you super [85:04] long sentences this is because of all [85:07] rhf um annotator distribution shift uh [85:11] like the distribution of annotators that [85:13] you use matters a lot and you have to [85:15] think like what is what is even the [85:17] humans that we want to represent in [85:18] these models uh now the question is like [85:20] crowdsourcing ethics uh like usually [85:23] these basically a lot of the the [85:25] labeling that is done um like the people [85:28] who do them are not paid well and they [85:30] have to go through a lot of toxic data [85:32] uh because you basically want the model [85:34] to avoid saying the toxic data um so [85:37] crowdsourcing ethics [85:39] too so many challenges with human data [85:42] um so what we did also last year is [85:45] again the same thing as alpaca just the [85:47] idea of like oh well they're challenges [85:49] with humans maybe we can just replace [85:50] them with llms uh so what we did is [85:53] simply replace [85:54] um oh I see that I'm just realizing that [85:57] the slides are not sented anyways uh you [86:00] replace a human preference with LM [86:01] preferences uh so here on this uh figure [86:04] you see on the xaxis the price that we [86:06] paid uh for collecting human data it's [86:09] around [86:10] $300 for 1,000 examples and this is on [86:13] mechanical turkers which are usually [86:15] like cheaper than than maybe some of the [86:17] other um companies that you could go [86:20] through and on the Y AIS it's basically [86:22] the agreement with uh other humans with [86:25] the mode of other humans and what you [86:27] see is that actually as I told you [86:28] before labeling is really complicated [86:30] humans agree with themselves only around [86:33] 66% of the time on a binary Tas and it's [86:36] not that the humans are not good here [86:38] because uh we were five main authors on [86:40] this paper we tried to label this data [86:43] ourselves and we only had like say 67 or [86:46] 68% accuracy even though we talk like we [86:48] talk for like 3 hours of how we should [86:50] be doing labeling really it's [86:52] complicated it's not an easy task um and [86:54] here I just showed many different models [86:56] and um basically you see that models are [86:58] much cheaper and they can actually get [87:00] higher agreement with the mode of humans [87:02] than human humans themselves and the [87:04] reason why is because humans have a lot [87:06] of varant models have no varant so they [87:08] might be a little bit more biased but [87:09] have less virence uh so it works [87:12] surprisingly well and now it's kind of [87:14] the standard in open uh Source Community [87:16] I think even in Industry a lot of people [87:18] use both humans and llms for improving [87:21] uh the colle collection of allf data [87:24] um and this is like this is the paper [87:26] from last year but honestly now it's [87:28] more like that llms would be around this [87:30] agreement and this cost so around I [87:32] would say 50x cheaper than humans and [87:34] better agreement with human than humans [87:38] themselves okay so that gets us to [87:41] evaluation of post [87:43] training um that goes back to your [87:46] initial question at the beginning of the [87:47] lecture how do you evaluate something [87:49] like chpt uh the answers that chpt could [87:51] give are basically unbounded and it's [87:54] not that there one right answer there [87:56] are many answers that are just as good [87:58] um so there are many challenges one you [88:01] can't use validation loss because one [88:05] method might use po the other one might [88:06] use DPO validation loss is not [88:08] comparable second you can't use Cal uh [88:10] sorry perplexity that's the thing I told [88:12] you before these models uh are not [88:15] calibrated they don't give distributions [88:17] they they just optimize for one thing so [88:19] you can't use perplexity for actually [88:21] evaluating uh these type of models once [88:23] they're aligned sorry one Z lined third [88:27] uh there's a large diversity of [88:28] questions that human might ask to these [88:30] models generation open QA like some [88:33] question answering some summarization [88:35] and all of these things so there's so [88:36] many things you have to cover um then [88:39] the tasks are really open-ended so it's [88:41] very hard to automate so that's what you [88:43] were alluding to before so the idea uh [88:46] is that instead of trying to come up [88:48] with really easily automated uh [88:50] benchmarks uh it's just we're going to [88:52] ask questions that that users actually [88:54] ask to these models in practice and [88:56] we're just going to ask annotators to [88:58] say between these two models which one [89:01] is better like what's the what's the [89:02] better output so basically do exact same [89:04] thing as um basically the data from rhf [89:08] but you use it now for evaluation yes [89:10] I'm not sure I understand what you mean [89:12] by like can't use perplexity and not [89:13] calibrated right like LM is still doing [89:16] like next token [89:18] prediction so I can't so think about um [89:23] the optim solution after doing PO is [89:26] basically one model that gives you uh [89:28] essentially a Delta um like basically [89:31] says that there's only one sentence that [89:33] is that could be generated for that [89:36] question so now if you use it on [89:38] something that is slightly semantically [89:39] differently different it would actually [89:41] give a likelihood of zero for that [89:43] answer so in reality it's not that [89:45] extreme because as you say it's still a [89:47] distribution but I just shows you that [89:48] there's a there's a fundamental issue [89:50] with perplexity once these models are [89:53] not llms anymore they were not trained [89:56] at least with P they were not trained to [89:57] to do maximum likelihood anymore they [89:59] were trained to be [90:02] policies okay um so probably the most [90:05] common or like the most um yeah the most [90:08] common Benchmark or the most trusted one [90:11] is what we call Chad uh sorry chatbot [90:12] Arena uh which is basically go on [90:15] internet have random users on the [90:17] internet blindly talk with two chat Bots [90:19] just ask many questions see the two [90:21] answers and rate which one is better and [90:24] and you do that over hundred of [90:25] thousands of users and then you get uh [90:27] the actual preferences and you get [90:29] rankings of models uh so you can go [90:31] right now on chatbot Arena and actually [90:34] interact with these models um one [90:36] potential issue just to highlight is [90:38] that while people who want to do these [90:40] type of things are usually more like [90:41] Tech driven um or like techsavvy uh so a [90:44] lot of the questions that you will ask [90:46] are more like Tech stuff discussing [90:47] software errors inquiries about AI tools [90:50] and all these things um so another issue [90:53] is cost and speed if you really want to [90:55] use something like this for development [90:57] process um it will be too costly because [90:59] you would need to basically pay a lot of [91:01] humans to do that so one simple idea is [91:06] again as we said many times just use LM [91:08] instead of humans uh you probably know [91:11] the drill at this point uh steps for [91:13] every instruction generate outputs by [91:15] some baseline and the model that you [91:17] want to evaluate um so here you imagine [91:20] that I I'm comparing an answer from Chad [91:22] GPT and from [91:24] I'm just asking a model uh another model [91:28] uh which one is better and I just [91:30] basically average that out uh yeah I [91:32] asked gp4 which one is better I average [91:35] that out over my entire distribution [91:37] over my entire Benchmark or data set and [91:39] that gives me a RN rate so RN [91:41] probability for one model compared to [91:43] another one and now you can rank models [91:46] uh and this is the Alpa eval uh [91:49] leaderboard so the benefits of this is [91:52] that actually we show we get 98% [91:54] correlation with Chad B Arena so very [91:56] high correlation with humans um so this [91:59] is yeah comparison with correlation with [92:01] other benchmarks and it takes less than [92:03] three minutes and less than $10 to run [92:05] so it's pretty cheap um there are [92:07] downsides though uh one of them is purus [92:10] correlation um so as we already saw [92:13] before LMS prefer this is one SP [92:15] correlation not many I'll just talk [92:17] about one LMS prefer longer outputs [92:18] actually humans also prefer longer [92:20] outputs but the problem or the issue [92:22] once you use llms is that once there [92:24] bias you will continue optimizing that [92:26] humans at some point I can guarantee you [92:28] if I ask a simple question and you give [92:29] me five pages of answers I'll be like no [92:31] I don't like that answer but LMS if they [92:33] have this bius and they were trained for [92:34] that they will continue preferring [92:36] longer outputs so uh here we see um the [92:41] the preference just showing that like [92:43] humans and models prefer longer outputs [92:45] um and here is another view of the [92:48] initial apaka eval data uh Benchmark [92:51] where when we asked um when we we rank [92:54] gp4 when we look at the Run rate of gp4 [92:56] versus actually uh gp4 itself if we com [93:00] if we use the standard GPT 4 it gets 50% [93:02] kind of by definition because we're [93:03] comparing GPT 4 versus gp4 but if we ask [93:07] a gbd4 to be slightly more verose so we [93:09] just say in the prompt be Vos in your [93:11] answers then it gets a r rate of [93:13] 64.4% so really there's a huge variance [93:16] and if we ask it to be concise it gets [93:18] 20% so there's a huge variance depending [93:20] on um whether you ask it to be concise [93:23] of [93:24] that's very annoying um so one possible [93:27] solution which is what we did is uh just [93:29] use some regression analysis I'm not [93:31] going to go into details but basically [93:33] use Cal inference tools to control for [93:35] length and right now uh actually length [93:37] matters much less so if you ask it to be [93:39] veros we still get some gains but much [93:43] less great so that's all about post [93:46] training and now for the next eight [93:48] minutes I might talk about systems or [93:50] just answer questions yes can you um go [93:54] back to your post training in terms of [93:56] post training how did we tune those [93:58] parameters using the small body of [94:01] fine-tuning data and have such big [94:04] effect on the model you mentioned [94:05] earlier that there's a different set of [94:07] hyperparameters are we changing just [94:10] some of the weights the later weights or [94:11] all the weights what's actually [94:13] happening yeah uh yeah I I kind of [94:15] skimmed through all of this you change [94:17] all the weights actually um industry [94:19] would change all the weights in open [94:21] source land you might have heard of [94:23] Laura which is going to change basically [94:26] only some of the weights or it actually [94:28] to be more specific it's going to add [94:30] some differences to the output of every [94:32] of every layer but but in Industry [94:34] you're going to just fine tune all the [94:36] weights um and also to say something [94:39] else about the data actually the SL St [94:41] all HF you usually going to collect uh a [94:44] lot more data than with sft so if fft is [94:46] like 5,000 10,000 maybe 50,000 with rhf [94:51] I think you're going to be more around [94:52] like the 1 million [94:54] uh order of magnitude it's still much [94:55] less than pre-training though yeah [94:57] because pre-training is 15 trillion [94:59] tokens I mean this is like that's not [95:01] even a drop and yet you influence the [95:04] weight a lot so because you do it I mean [95:06] you have to think that how you do it is [95:08] you use um I mean as I said the learning [95:12] rate that you're going to use is going [95:13] to be different but also you only do [95:15] that so just imagine if I train even if [95:18] I train on one sentence but over and [95:20] over again all at some point my model [95:23] will only that sentence even if uh it [95:26] was just one sentence instead of the 15 [95:28] trillion tokens so if you use a large [95:30] enough learning rate and for enough time [95:32] you will basically overfit that sentence [95:35] so the the the key thing to to remember [95:37] is that um the data is not I it's not as [95:40] if you mix some posttraining data and [95:42] some pre-training data you do [95:44] pre-training and then you just start [95:46] fine-tuning only on the post trining so [95:48] another way maybe another perspective is [95:50] that the post the pre-training is just [95:52] the initialization of your model [95:54] and once you view it that way that this [95:55] is just initialization of Weights then [95:58] there's nothing special like you don't [96:00] need to remember that you train a lot of [96:01] data before the only thing that matters [96:03] is that you had an initialization and [96:05] now I actually train a model so maybe [96:06] think about it that way like there's a [96:08] there's a mark of property in some way [96:10] just like you had your weights this is [96:11] my initialization now I'm training that [96:13] one does that kind of answer your [96:15] question kind of but you said something [96:19] just now about it's almost the [96:21] equivalence of just rerunning the find [96:23] tuning data many times is it actually is [96:26] that what actually happens in order to [96:29] give so much more preference [96:32] um you might I actually don't know right [96:36] now how they do it in Industry when we [96:38] did alpaca we had to do three box so you [96:40] did run it three times to it [96:43] um but I mean even the number of times [96:46] that you run it through it's actually [96:47] not important the only thing like the [96:49] only thing is the is kind of the [96:51] effective learning rate that what [96:53] matters [96:54] um so [96:56] yeah [96:57] great so I think I have five minutes [97:02] [Music] [97:05] right okay I might try to give a high [97:10] level Overview at least from one of the [97:12] systems trick systems as we said uh for [97:17] everyone Bott neck is a sorry compute is [97:20] the huge bottleneck uh one question you [97:22] might ask is why not buy more gpus uh [97:25] gpus are expensive but also are scarce [97:26] even if you have $10 million right now [97:28] you cannot buy the best gpus um [97:32] there's oh yeah there's also some [97:34] physical limitations when you have when [97:36] you have multiple gpus you have to [97:37] communicate between them that takes time [97:40] um so just buying more gpus is not that [97:42] easy um so it's really important to [97:44] think about how do you allocate [97:46] resources and how do you optimize your [97:47] pipeline so system 101 on gpus I'm sorry [97:51] I'm going slightly faster I hope for [97:53] that some of you at least can follow uh [97:55] gpus are basically optimized for [97:57] throughput CPUs are optimized uh for [98:00] latency so gpus the way you have to [98:03] think about it is that there's one Comm [98:04] there's one command that is run on many [98:07] many Calles at the same time on [98:08] different type of data um so this is how [98:12] you see a GPU you see there are many [98:14] different CES we call them streaming [98:16] multiprocessors which is very different [98:18] than the usual CPU architecture so just [98:20] think High throughput paralyzation for [98:23] gpus uh gpus are optimized for fast [98:26] matrix multiplication so every time you [98:28] will do uh you will do something on GPU [98:30] if you can do it with a a matrix [98:32] multiplication it's going to be 10 times [98:34] faster than with anything else uh that [98:37] is a little bit annoying because it [98:38] means that we're kind of uh bottlenecked [98:40] to doing anything with Matrix [98:43] multiplications um another thing to note [98:45] with gpus is that compute has been [98:47] improving faster than memory and [98:49] communication so right now gpus usually [98:53] are hard to keep uh like the data that [98:56] you send that send to gpus is actually [98:59] hard to keep up with the processess so [99:00] most of your gpus are actually going to [99:02] be idle if you just run normal code if [99:05] you don't optimize your code so [99:06] communication and this will continue [99:09] over time another thing to know about [99:11] gpus is that there's a memory hierarchy [99:13] this is the same thing actually with [99:14] CPUs but basically the closer you are to [99:17] your cuse the less memory there is but [99:19] the faster things run if you're further [99:21] more memory slower [99:24] um okay I'm going to skip that okay [99:26] actually I'm going to say it I told you [99:28] about this uh the fact of communication [99:31] uh the metric that people usually look [99:32] at is model flop utilization so what is [99:34] the theoretical maximum that GPU could [99:37] run at no more flops that you could use [99:38] per second divide sorry the number of OB [99:41] observed through put divided by this [99:43] theoretical um maximum and in general if [99:47] you reach 50% you're very happy like [99:49] Facebook I looked at Lama was at 45 or [99:51] something like this so that that means [99:54] that data doesn't come fast enough even [99:56] for these big [99:58] companies so one simple trick and that [100:00] might be the only one I'm going to tell [100:02] you about is low Precision one simple [100:05] idea is that well if I'm going to put my [100:07] floats in lower Precision then there's [100:09] going to be fewer bits that I have to [100:11] send to my gpus if there's fewer bits [100:13] it's faster communication lower memory [100:15] consumption things are going to go [100:16] faster uh and for deep learning it just [100:19] happens that de decimal is not that [100:21] important uh so so when you do matrix [100:24] multiplication when you do like for [100:26] example SGD there's already so much [100:27] noise that if you update something by [100:29] 0.01 or [100:31] 0.015 who cares uh so basically instead [100:34] of using uh 32 bits per float which is [100:38] um what people used to use or 64 for [100:41] example which is what you would use in [100:42] other domains you use 16 bits uh for [100:45] matrix multiplication so for every float [100:47] you use 16 bits um and for training you [100:50] have this type of like uh what we call [100:53] aut atic mix Precision which is that uh [100:55] some of the things are in 32 bits others [100:57] are in 60 bit in 16 bits um generally [101:00] the way you should be thinking about it [101:02] is that your weights are stored of your [101:04] model are stored in 32 bits um but just [101:07] before the computation you put [101:08] everything in 16 16 bits like this you [101:10] do computation super fast and at the end [101:13] you update your weights in 32 Bits And [101:16] the reason why you do all the updates in [101:17] 32 bits it's just think that if your [101:19] learning rate for example is very small [101:21] you still want to be able to like make a [101:23] difference in your weights uh so all the [101:25] computation is done in 16 bits but the [101:28] weights are actually stored in 32 bits [101:30] so that's like the standard way that [101:32] people are doing it um okay I'll [101:35] actually talk just about this and then [101:37] I'll skip all the rest operator Fusion [101:38] because I think this is actually pretty [101:39] cool as I just said communication is [101:41] very slow and actually every time you [101:44] use a pie torch line it basically moves [101:46] variable to Global memory of your GPU so [101:49] when you have something like this x do [101:52] cosine uh equal X1 and then you do X1 do [101:55] cosine what is happening behind the [101:57] scenes is that you take the X which is [101:59] data you ship it to your um to your [102:02] actual processes of your gpus you apply [102:04] the coign you ship it back to the main [102:06] memory of your GPU and then you see the [102:08] next sign you ship it back to the [102:10] computer to the GPU processor you apply [102:12] another cosign and you ship it back [102:14] again um so another way to see that is [102:17] that you go from your Dam which is your [102:18] Global memory in your GPU and you ship [102:21] it to compute you ship it back for every [102:23] line This is a naive way of doing it [102:25] this seems very wasteful um so the idea [102:29] simple idea of operative Fusion is just [102:32] communicate do all the computation ship [102:34] it back once and this is exactly what [102:37] fuse kernels are um so if you ever want [102:40] to make your comp your computations in [102:44] pytorch much faster just apply torch. [102:47] compile on your model this is going to [102:49] make your model around two times faster [102:52] and what it does is simply that it [102:53] rewrites your code uh your P like your [102:57] py torch code basically in C++ in Cuda [103:01] uh to to do the communication only once [103:04] then do all the operations then uh ship [103:06] it back okay I'm not going to have time [103:09] to talk about tiling tiling is important [103:11] paration paration is important um and [103:16] mixture of experts mixture of experts is [103:18] important Outlook there are many things [103:20] we haven't T talked about we haven't [103:23] talked about architectures we definitely [103:25] haven't talked about inference um there [103:27] are many other things that are important [103:29] with LMS what is the UI that you use I [103:31] mean arguably chat jpt the big novelty [103:33] was just have a simple UI to use it [103:35] multimodality what are all the misuses [103:37] you could have uh the fact that there [103:39] might not be enough data on the internet [103:41] to train all these models legality of [103:43] data collection so many other things if [103:45] you are interested in all these topics [103:47] uh I would suggest three classes cs224n [103:50] is probably the one that touches the [103:52] least on uh LMS uh but it gives some [103:55] background and historical context um of [103:58] all the LMS and gives kind of some [104:00] adjacent material CS 324 I think it's [104:03] called Uh I think it's just called large [104:06] language models uh more in-depth reading [104:08] and lectures on everything I talked [104:09] about CS 336 which is large language [104:12] model from scratch you actually build [104:14] your own llm uh it's an amazing class [104:18] also given by my two supervisors very [104:20] heavy workload so be careful and um [104:23] great