[0:05] so let's get started uh so I'll be
[0:07] talking about building llms today um so
[0:10] I think a lot of you have heard of llms
[0:12] before uh but just as a quick recap uh
[0:16] llms standing for large language models
[0:18] are basically all the chat Bots uh that
[0:20] you've been hearing about recently so uh
[0:24] Chad GPT from open ey Claud from
[0:27] entropic Gemini and and lman other type
[0:30] of models like this and today we'll be
[0:32] talking about how do they actually work
[0:34] so it's going to be an overview because
[0:35] it's only one lecture and it's hard to
[0:37] compress everything but hopefully I'll
[0:39] touch a little bit about all the
[0:40] components that are needed to train uh
[0:42] some of these llms uh also if you have
[0:44] questions please interrupt me and ask uh
[0:47] if you have a question most likely other
[0:49] people in the room or on Zoom have other
[0:52] have the same question so please ask um
[0:56] great so what matters when training llms
[0:59] um so there a few key components that
[1:01] matter uh one is the architecture so as
[1:04] you probably all know LMS are newal
[1:06] networks and when you think about new
[1:08] networks you have to think about what
[1:10] architecture you're using and another
[1:12] component which is really important uh
[1:13] is the training loss and the training
[1:15] algorithm um so how you actually train
[1:18] these models then it's data so uh what
[1:21] do you train these models on um the
[1:24] evaluation which is how do you know
[1:26] whether you're actually making progress
[1:28] towards the goal of of uh llms and then
[1:32] the system component so that is like how
[1:34] do you actually make these models run on
[1:37] uh Modern Hardware which is really
[1:38] important because these models are
[1:39] really large um so now more than ever
[1:42] system is actually really an important
[1:44] topic um for
[1:46] llms so those five components um You
[1:50] probably all know that llms and if you
[1:52] don't know LMS are all based on
[1:54] Transformers or at least some version of
[1:56] Transformers uh I'm actually not going
[1:58] to talk about the AR lecture today uh
[2:01] one because I gave a SE lecture on um
[2:04] Transformers a few weeks ago and two
[2:06] because you can find so much information
[2:08] online on uh Transformers but I think
[2:10] you can it's there's much less
[2:12] information about the other four topics
[2:14] so I really want to talk about those um
[2:17] another thing to say is that most of
[2:19] Academia actually focuses on
[2:21] architecture and training algorithm and
[2:23] losses um as academics and I've done
[2:26] that for a lot big part of my career is
[2:29] simply we like thinking that this is uh
[2:31] like we make new architectures new
[2:33] models and it it seems like it's very
[2:36] important but in reality honestly what
[2:38] matters in practice is mostly the three
[2:40] other topics so data evaluation and
[2:43] systems uh which is what of most of
[2:45] Industry actually focuses on um so
[2:48] that's also one of the reason why I
[2:49] don't want to talk too much about the
[2:51] architecture uh because really the rest
[2:52] is super
[2:53] important um great so overview of the
[2:56] lecture I'll be talking about
[2:57] pre-training so pre-training uh you
[2:59] probably heard that word this is the
[3:01] general word this is kind of the
[3:02] classical language modeling uh Paradigm
[3:06] uh where you basically train your
[3:07] language model to essentially model all
[3:09] of internet and then there's a post
[3:11] training which is a more recent Paradigm
[3:13] which is taking these large language
[3:14] models and making them essentially AI
[3:17] assistants um so this is more of a
[3:19] recent Trend since Chad GPT uh so if you
[3:23] ever heard of gpt3 or gpt2 that's really
[3:25] pre-training land uh if you heard of
[3:28] chat GPT which you probably have this is
[3:30] really posttraining land uh so I'll be
[3:32] talking about both but I'll start with
[3:34] pre-training and uh specifically I'll
[3:36] talk about what is the task of
[3:38] pre-training llms and what is the laws
[3:40] that people actually
[3:42] use so language modeling this is a quick
[3:45] recap uh language models at a high level
[3:48] are simply models of probability
[3:50] distribution over sequences of tokens or
[3:52] of words so it's basically some uh model
[3:56] of P of X1 to XL where X1 is basically
[3:59] word one and Excel is the last one in
[4:01] the sequence or in the sentence um so
[4:04] very concretely if you have a sentence
[4:06] like the mouse ate the cheese what the
[4:08] language model gives you is simply a
[4:10] probability of this sentence being
[4:13] uttered by a human or being found on on
[4:16] online uh so if you have another
[4:18] sentence like the the mouse at cheese uh
[4:22] here there's grammatical mistakes so the
[4:23] model should know that this uh should
[4:25] have some syntactic knowledge so it
[4:27] should know that this has less
[4:29] likelihood of appearing
[4:31] online uh if you have another sentence
[4:34] like the cheese ate the mouse uh then
[4:36] the model should hopefully know about
[4:38] the fact that usually cheese don't eat
[4:41] Mouse um so there's some semantic
[4:43] knowledge and this is less likely than
[4:44] the first sentence so this is basically
[4:46] at a high level what language models are
[4:49] um one word that you probably have been
[4:51] hearing a lot in the news are generative
[4:53] models uh so this is just something that
[4:55] can generate models that can generate
[4:57] sentences or can generate some data uh
[4:59] the reason why we say language models
[5:01] are generative models is that once you
[5:02] have a model of a distribution you can
[5:04] simply sample from this model and now we
[5:06] can generate data uh so you can generate
[5:08] sentences uh using a language
[5:11] model so the type of models that uh
[5:14] people are all currently using are what
[5:16] we call Auto regressive language models
[5:19] and the key idea of autor regressive
[5:21] language models is that you take this
[5:23] distribution over words and you
[5:26] basically decompose it into the into the
[5:28] distribution of the first word multiply
[5:31] the by the distribution of or the
[5:33] likelihood of the distribution of the
[5:34] second word given the first word uh
[5:37] multiply by P of the third word given
[5:39] the first two words um so there's no
[5:42] approximation here this is just the
[5:43] chain rule of probability which you
[5:44] hopefully all know about uh really no
[5:46] approximation this is just one way of
[5:48] modeling a
[5:49] distribution uh so slightly more
[5:51] concisely you can write it as a product
[5:53] of U of PS of the next word given
[5:56] everything which happened in the past so
[5:58] of the context and uh so this this is
[6:00] what we call Auto regressive language
[6:02] models again this is really not the only
[6:04] way of modeling distribution this is
[6:06] just one way uh it has some benefits and
[6:09] some downsides one downside of
[6:11] autoaggressive language models is that
[6:13] when you actually sample from this
[6:14] autoaggressive language model you
[6:16] basically have a for Loop which
[6:17] generates the next word then conditions
[6:20] on that next word and then regenerate an
[6:22] other word so basically if you have a
[6:24] longer sentence that you want to
[6:25] generate you it takes more time to
[6:27] generate it uh so there are some
[6:28] downsides of this current Paradigm but
[6:31] that's what we currently have so I'm
[6:33] going to talk about this
[6:34] one uh great so Auto regressive language
[6:37] models at a high level um what the task
[6:40] of autoregressive language model is is
[6:42] simply predicting the next word as I
[6:43] just said so if you have a sentence like
[6:45] she likely prefers uh one potential next
[6:48] word might be dogs and the the way we do
[6:51] it is that we first tokenize so you take
[6:55] these words or subwords you tokenize
[6:57] them um and then you give an IDE for
[7:00] each token so here you have 1 2 three uh
[7:03] then you pass it through this black box
[7:05] as I already said we're not going to
[7:06] talk about the architecture you just
[7:07] pass it pass it through a model and you
[7:10] then get a distribution a probability
[7:12] distribution over the next word over the
[7:15] next token and then you sample uh from
[7:19] this distribution you get a new token
[7:21] and then you DET tokenize so you get a
[7:23] new ID you then DET toonize and that's
[7:25] how you basically sample from a language
[7:27] model uh one thing which is important to
[7:29] not is that the last two TS uh two steps
[7:31] are actually only need needed during
[7:33] inference uh when you do training you
[7:35] just need to predict uh the most likely
[7:37] token and you can just compare to the
[7:39] real token which happen in practice and
[7:41] then you basically change the weights of
[7:43] your model to increase the probability
[7:45] of generating that
[7:48] token um great so autoaggressive neural
[7:51] language models so to be slightly more
[7:53] specific still without talking about the
[7:54] architecture uh the first thing we do is
[7:57] that we have all of these oh sorry yes
[7:59] on the previous slide when you're
[8:01] predicting the probability of the next
[8:02] tokens does this mean that your final
[8:04] like output VOR has to be the same
[8:06] dimensionality as the number of tokens
[8:08] that you have yes how do you deal with
[8:11] like if you have more to like if you're
[8:13] adding more tokens to your cor something
[8:16] yeah so we're going to talk about
[8:17] tokenization actually later uh so you
[8:19] will get some sense of this you
[8:22] basically can deal with adding new
[8:24] tokens I am I'm kind of exaggerating
[8:26] there are methods for doing it but
[8:27] essentially people don't do it um so
[8:30] it's really important to think about how
[8:32] you tokenize your text and that's why
[8:33] we'll talk about that later but it's a
[8:36] very good point to notice that you
[8:37] basically the vocabulary size so the
[8:38] number of tokens that you have is
[8:40] essentially the output of your uh
[8:42] language model so it's actually pretty
[8:44] pretty
[8:45] large okay so autoaggressive new
[8:47] language models first thing you do is
[8:49] that you take every word or every token
[8:52] you embed them so you get a um some
[8:55] Vector representation for each of these
[8:57] tokens um you pass them through some ual
[8:59] Network as we said it's a Transformer
[9:01] then you get a representation for all
[9:03] the word in all the words in the context
[9:06] so it's basically representation of the
[9:08] entire sentence uh you pass it through a
[9:10] linear layer as you just said to
[9:13] basically map it to the number so that
[9:16] the output the number of outputs is the
[9:17] number of tokens uh you then pass it
[9:20] through some soft Max and you basically
[9:22] get uh probity distribution over the
[9:25] next words given every word in the
[9:27] context
[9:30] and the law that you use is basically
[9:32] it's essentially a task of classifying
[9:34] the next token so it's a very simple
[9:36] kind of machine learning task so you use
[9:37] the cross entry P loss where you
[9:39] basically you look at the actual Target
[9:43] that happened which is a target
[9:45] distribution which is a one hot encoding
[9:46] which here in this in this case says I
[9:49] saw uh the real word that happened is
[9:51] cat so that's a one hot um distribution
[9:54] over cat and here this is the actual uh
[9:57] do you see my mouse oh yeah this is the
[9:59] distribtion that you generated and
[10:00] basically you do cross entropy which
[10:02] really just increases the probability of
[10:03] generating cat and decreases all the the
[10:05] probility of generating all the other
[10:07] tokens one thing to notice is that as
[10:10] you all know again uh this is just
[10:12] equivalent to maximizing the text log
[10:14] like the text log likelihood because you
[10:16] can just rewrite the the max over the
[10:20] probability of um this autoregressive
[10:22] language moding task as just being this
[10:25] minimum over I just added the log here
[10:27] and minus which is just the minimum of
[10:30] the loss which is the cross enty loss so
[10:31] basically minimizing the loss is the
[10:33] same thing as maximizing the likelihood
[10:35] of your text any question
[10:42] questions
[10:43] okay
[10:45] tokenizer um so this is one thing that
[10:48] people usually don't talk that much
[10:50] about tokenizers are extremely important
[10:53] uh so it's really important that you
[10:54] kind of understand at least uh what they
[10:56] do at a high level so why do we need
[10:58] token in the first place uh first it's
[11:01] more General than words so one simple
[11:04] thing that you might think is oh we're
[11:05] just going to take every word that we
[11:06] will have you just say every word is a
[11:09] new is a token in its own um but then
[11:11] what happens is if there's a typo in
[11:13] your word then you might not have any
[11:16] token associated with this this word
[11:19] with a typo and then you don't know how
[11:20] to actually pass this word with a typo
[11:22] into a large language model so what do
[11:24] you do next and also even if you think
[11:27] about words words is a very like words
[11:29] are fine with like Latin based languages
[11:32] uh but if you think about a language
[11:34] like taii you won't have a simple way of
[11:36] tokenizing by spaces because there are
[11:37] no spaces between words um so really uh
[11:41] tokens are much more General Than Words
[11:43] first thing second thing that you might
[11:45] think is that you might tokenize every
[11:47] sentence character by character you
[11:49] might say a is one token b is another
[11:51] token uh that would actually work and
[11:54] probably very well the issue is that
[11:56] then your sequence becomes super long
[11:58] and as you probably remember from the
[12:00] lecture on on Transformers uh the
[12:02] complexity uh grows quadratically with
[12:05] the length of sequences so you really
[12:07] don't want to have a super long sequence
[12:09] um so tokenizers basically try to deal
[12:13] with those two problems and give common
[12:17] subsequences a certain token and usually
[12:20] how you should be think about is around
[12:22] uh an average every token is around
[12:24] three four letters
[12:26] um and there are many algorithm for
[12:29] tokenization I'll just talk about one of
[12:31] them to give you a high level which is
[12:32] what we call bite P en coding which is
[12:34] actually pretty common one of the two
[12:35] most common tokenizers and the way that
[12:38] you train a tokenizer is that first you
[12:40] start with a very large Corpus of text
[12:42] and here I'm really not talking about
[12:44] training a large language model yet this
[12:45] is purely for the tokenization step uh
[12:48] so this is my large Corpus of text with
[12:50] these five words um then you associate
[12:54] every character in this Corpus of text a
[12:57] different token uh so here I just split
[12:59] up every character with a different
[13:01] token uh and I just color coded all of
[13:04] those tokens and then what you do is
[13:06] that you go through your text and every
[13:08] time you see pairs of tokens that are
[13:11] very common the most common pair of
[13:13] token you just merge them so here you
[13:15] see three times the the the tokens T and
[13:19] O next to each other so you're just
[13:21] going to say this is a new token and
[13:22] then you continue you repeat that so now
[13:24] you have to talk which happens three
[13:27] times to with an E that happens sorry
[13:31] two times and an token which happens
[13:34] twice and then ex which also happen
[13:36] twice so this is that if you were to
[13:39] train a tokenizer on this Corpus of text
[13:41] which is very small that's how you would
[13:43] uh finish with a token with a pre like a
[13:45] trained tokenizer uh in reality you do
[13:48] it on on much larger corpuses of text um
[13:51] and this is the real tokenizer of uh
[13:54] actually I think this is gpt3 or chat
[13:56] GPT uh and here you see how it would
[13:59] actually separate these words so
[14:00] basically you see the same thing as what
[14:01] we gave in the previous example token
[14:04] becomes its own token so tokenizer is
[14:08] actually split up into two tokens token
[14:10] and iser um so yeah that's all about
[14:14] tokenizers any questions on that yeah
[14:16] how do you deal with spes and how do you
[14:18] deal
[14:19] with yeah so actually there's a a step
[14:22] before tokenizers which is what we call
[14:24] pre- tokenizers which is exactly what
[14:26] you just said uh so this is mostly
[14:29] in theory there's no reason to deal with
[14:31] spaces and punctuation separately you
[14:34] could just say every space gets its own
[14:36] token every um uh punctuation get its
[14:39] own token and you can just do all the
[14:41] merging the problem is that so there's
[14:43] an efficiency question actually training
[14:45] these tokenizes takes a long time uh so
[14:48] you better off because you have to
[14:49] consider every pair of token so what you
[14:52] end up doing is saying if there's a
[14:53] space this is very like pre- tokenizes
[14:55] are very English specific you say if
[14:57] there's a space we're not going to start
[14:59] looking at the the token that came
[15:00] before and the token that came
[15:02] afterwards so you're not merging in
[15:04] between spaces but this is just like a
[15:07] optimiz like a computation optimization
[15:10] you could theoretically just deal with
[15:11] it um the same way as you deal with any
[15:13] other character and yeah when you merge
[15:17] tokens do you delete the tokens that you
[15:19] merged away or do you keep the the
[15:21] smaller tokens that merge um you
[15:23] actually keep the smaller tokens I mean
[15:25] in reality it doesn't matter much
[15:26] because um usually on large Corpus of
[15:31] text you will have actually everything
[15:32] uh but you usually keep the small ones
[15:34] and the reason why you want to do that
[15:35] is because if in case there's as we said
[15:37] before you have some um some grammatical
[15:40] mistakes so some typos you still want to
[15:42] be able to represent these words by
[15:44] character um so yeah yes are the tokens
[15:50] unique so I mean say in this case T Ken
[15:54] is there only one occurrence or could do
[15:56] you need to leave multiple occurr so
[15:59] they could have take on different
[16:00] meanings or something oh oh I see what
[16:02] you say no no it's every token has its
[16:05] own uh unique ID um so a usual this is a
[16:10] great question for example if you think
[16:11] about a bank which could be bank for
[16:14] like money or bank like water um it will
[16:16] have the same token but the model will
[16:18] learn the Transformer will learn that
[16:20] based on the words that are around it it
[16:23] should associate that I'm saying I'm
[16:25] being very high wavy here but associate
[16:27] that with the with a with a
[16:29] representation that is either more like
[16:31] the bank money side or the Bank water
[16:34] side um but that's a Transformer that
[16:35] does that it's not a
[16:37] tokenizer yes yeah so you mentioned
[16:40] during tokenization keep the smaller
[16:41] tokens you started with right like if
[16:44] you start with a t you keep the T and
[16:46] then you build your tokenizer to the
[16:48] that you can now in token so let's say
[16:50] maybe you didn't train on token but like
[16:52] in your data you are trying to encode
[16:54] token so how does the tokenizer know to
[16:57] encode it with token or
[17:00] a great question you basically when you
[17:01] so when you tokenize so that's after
[17:03] training of the tokenizer when you
[17:04] actually apply the tokenizer you
[17:06] basically always choose the largest uh
[17:09] token that you can apply uh so if you
[17:11] can do token you will never do T you
[17:13] will always do token um but there's
[17:16] actually so people don't usually talk
[17:18] that much about tokenizers but uh
[17:20] there's a lot of of computational
[17:22] benefits uh or computational tricks that
[17:24] you can do for making these things
[17:26] faster uh so I really don't think we and
[17:28] honestly I think a lot of people think
[17:29] that we should just get away from
[17:31] tokenizers um and just kind of tokenize
[17:34] character by character or bites by bites
[17:37] uh but as I said right now there's this
[17:38] issue of like length uh but maybe one
[17:40] day like in five or 10 years we will
[17:42] have different architectures that don't
[17:43] scale quadratically with the length of
[17:45] the sequence and uh maybe we'll um yeah
[17:49] move away from tokenizes so can you
[17:51] share with us the drawback why do people
[17:53] want to move away from the tokenizer oh
[17:56] um yeah so think
[18:00] one good example is uh math if you think
[18:03] about math actually numbers right now
[18:06] are not tokenized so for example 327
[18:08] might have its own token which means
[18:10] that models when they see numbers they
[18:13] don't see them the same way as we do and
[18:15] this is very annoying because what I
[18:17] mean the reason why we can kind of
[18:18] generalize with math is because we can
[18:20] deal with every every letter separately
[18:22] and we can then do composition where you
[18:24] know that basically if you add stuff
[18:26] it's just the same thing as adding every
[18:28] one separately plus like whatever the
[18:29] unit that you add so they can do that um
[18:32] so then you have to do like special
[18:34] tokenization and like one of the big
[18:36] changes that GPT 4 did uh is changing
[18:40] the way that they tokenize uh code so
[18:43] for example uh if you have code you know
[18:45] you have like often in Python these four
[18:46] spaces at the beginning those were dealt
[18:49] with uh kind of strangely before um and
[18:52] as a result like the model couldn't
[18:54] really understand uh how to deal with
[18:56] code uh so so toiz actually a lot um
[19:00] okay so I'll move on right now but we
[19:03] can come back later on token Isis great
[19:06] so we talked about the task the L the
[19:07] tokenizer let's talk a little bit about
[19:10] evaluation uh so the way that LMS are
[19:12] usually evaluated is what we call is
[19:14] using what we call perplexity um at a
[19:17] high level it's basically just your
[19:18] validation loss uh the slight difference
[19:20] with perplexity is that we use something
[19:22] that is slightly more interpretable
[19:24] which is that we use the average per
[19:26] token loss and then you expon entiate it
[19:29] and the reason why you exponentiate it
[19:31] is because you want I mean the loss has
[19:33] a log inside and you like one humans are
[19:36] actually pretty bad at thinking in log
[19:37] space but two logs depend on the base of
[19:40] the log uh while when you exponentiate
[19:42] you basically have everything in the uh
[19:45] kind of the vocabulary size uh unit um
[19:48] and the average proten is just so that
[19:50] your your complexity is independent of
[19:52] the length of your sequence um so
[19:54] perplexity is just two to the power uh
[19:56] average of the loss of the sequence
[19:59] um so perplexity is between one and the
[20:03] length of the vocabulary of your
[20:04] tokenizer uh one it's simply well if you
[20:07] predict perfectly the thing which uh
[20:09] every word then every word will have
[20:12] basically product of ones uh so the best
[20:15] perplexity you can have is one if you
[20:17] really have no idea you basically
[20:18] predict with one divided by uh size of
[20:21] vocabulary um and then you do simple
[20:23] math and you basically get perplexity of
[20:25] size of vocabulary uh so the intuition
[20:27] of perplexity is that basically the
[20:29] number of tokens that your model is kind
[20:31] of hesitating between uh so if you if
[20:33] your model is perfect it doesn't
[20:34] hesitate it know exactly the word if it
[20:37] really has no idea then it hesitates
[20:39] between uh all of the
[20:42] vocabulary uh so perplexity really
[20:45] improved that's perplexity on a standard
[20:48] data set between 2017 and 2023 it it
[20:51] went from kind of 70 tokens to less than
[20:54] 10 tokens over these five six years so
[20:56] that means that the models were
[20:58] previously as dating between 70 words
[21:00] every time it was generating a word and
[21:02] now it's as dating between like less
[21:04] than 10 words so that's much better
[21:06] perplexity is actually not used anymore
[21:08] in academic benchmarking mostly because
[21:10] it depends on the tokenizers that you
[21:12] use uh it depends on the actual data
[21:14] that people are evaluating on but it's
[21:16] still very important for development of
[21:18] llms so when you when you actually train
[21:20] your own llm people will still really
[21:22] look at the
[21:24] perplexity uh one common other way and
[21:28] now more common in Academia of
[21:30] evaluating these llms is just by taking
[21:33] all the classical NLP benchmarks and
[21:35] I'll give you a few examples later and
[21:37] just kind of aggregating everything um
[21:39] so collect as many automatically
[21:41] evaluatable benchmarks and just evaluate
[21:44] across all of them um so one such if uh
[21:48] or actually two such uh benchmarks of
[21:51] what we call uh Helm which is from
[21:53] Stanford and another one is the hugging
[21:55] face open LM leader board which are the
[21:57] probably two two most common ones right
[21:58] now um so just to give you an idea in
[22:02] Helm there are all of these type of
[22:03] tasks which are mostly things that can
[22:06] be easily evaluated uh like question
[22:09] answering so think about many different
[22:11] question answering uh tasks um and the
[22:14] benefit with question answering is that
[22:15] you usually know what is the real answer
[22:18] um so you can the way that you evaluate
[22:20] these models and I'll give you a
[22:21] concrete example in one second um is
[22:23] that you can just look at How likely the
[22:25] language model is to generate the real
[22:28] answer compared to some other answers
[22:30] and that's essentially at a high level
[22:32] how you evaluate these models um so to
[22:34] give you a specific example mlu is
[22:36] probably the most common um academic
[22:39] Benchmark for
[22:41] llms uh and this is just a collection of
[22:44] many question and answers in all of
[22:46] those domains for example College
[22:48] medicine College physics astronomy and
[22:51] these type of topics and the questions
[22:53] are things like so this in astronomy
[22:55] what is true for type 1 a supernova then
[22:58] you give uh four different potential
[23:01] answers and you just ask the model which
[23:03] one is more likely so there are many
[23:05] different ways of doing it either you
[23:07] can look at the likelihood of generating
[23:09] all these answers uh or you can ask the
[23:11] model which one is the most likely uh so
[23:13] there are different ways that you can
[23:14] promp the model but at a high level you
[23:16] know which one is correct and there are
[23:17] three other mistakes um yes kind
[23:22] creating is like unconstrained text as
[23:24] the output yeah how do you evaluate a
[23:26] model if it give something that's you
[23:29] know semantically completely identical
[23:32] but is not the exact token list that
[23:35] expect yeah so that's a great question
[23:37] I'll talk more about that later here in
[23:39] this case we don't do unconstrained so
[23:41] the way you would evaluate MML is
[23:43] basically either you you ask the first
[23:46] question and then you look at the
[23:47] likelihood of the model generating a the
[23:50] likelihood of the model generating b c
[23:53] and d and you look at which one is the
[23:54] most likely or you can as the model out
[23:57] of ABC d which one is the most likely
[23:59] and you look at whe the to the most
[24:01] likely next token is A B C or D so uh
[24:04] you can strain the model to say it can
[24:06] only answer these four things you say
[24:09] you constraint the model you mean you
[24:11] constraint The Prompt or do you mean of
[24:13] its whole probability distribution
[24:15] outputs you only comparing the outputs
[24:18] like you're only comparing the
[24:20] a so uh in the second case I gave you
[24:23] you would do exactly the I actually you
[24:24] would do both you would prompt the model
[24:26] saying ABC or D plus you would constrain
[24:28] to only uh look at these two these four
[24:31] tokens in the first case you don't even
[24:33] need to generate anything so in the
[24:34] first case you literally just look given
[24:36] that it's a language model it can give a
[24:38] distribution over sentences you just
[24:40] look at what is the likelihood of
[24:42] generating all of these words what is
[24:45] the likelihood of generating the second
[24:47] choice and you just look at whether the
[24:49] most likely sentence is actually the
[24:52] real answer so you don't actually sample
[24:55] from it you really just use P of x one
[24:58] to excel does that make sense uh that
[25:01] being said evaluation of open-ended
[25:04] questions is something we're going to
[25:05] talk about later and is actually really
[25:07] important and really challenging yes
[25:10] earlier you mentioned that um like um
[25:13] metrics like flexity are not are not
[25:15] like usually used because it depends on
[25:17] like how you do your terization some
[25:19] design choices I was wondering if you
[25:21] could speak more to that oh um yeah so
[25:25] think about perplexity I told you
[25:27] perplexity is between one and vocabulary
[25:29] size so now imagine that Chad GPT uses a
[25:32] tokenizer that has like 10,000 tokens
[25:35] but Gemini from Google uses a tokenizer
[25:37] that had 100,000 uh potential tokens
[25:41] then actually the Gemini one will will
[25:44] have like the upper bound of the the
[25:46] perplexity that you can get is actually
[25:47] worse for Gemini than for Chad GPT does
[25:51] that make sense so that's just an idea
[25:54] it's actually a little bit more
[25:55] complicated than that but that's just
[25:56] like one uh first or the bit of you can
[25:58] see that the tokenizer actually
[26:01] matters um
[26:04] great okay so evaluation challenges
[26:07] there are many I'll just talk about two
[26:09] really briefly uh one as I told you
[26:11] there are two ways of doing evaluation
[26:13] for these mlu actually there are many
[26:15] more than two but I give you two
[26:16] examples um and it happens that for a
[26:19] long time even though that was a very
[26:21] classical Benchmark that everyone used
[26:23] uh actually different uh different
[26:26] companies and different um different uh
[26:30] uh different organization were actually
[26:32] using different ways of evaluating mlu
[26:35] and as a result you could you get
[26:36] completely different results for example
[26:38] Lama
[26:39] 65b uh which was the first model of meta
[26:42] in the Lama series uh had on Helm 63.7
[26:47] accuracy but on this other um Benchmark
[26:50] had like
[26:51] 48.8 um so really the way that you
[26:54] evaluate and this is not even talking
[26:56] about prompting this is really just kind
[26:58] of the the way that you evaluate the uh
[27:00] the models prompting is another issue so
[27:02] really there are a lot of
[27:03] inconsistencies it's not as easy as it
[27:06] looks uh first thing yeah sorry how can
[27:09] we make sure that all these models AR
[27:10] trained on The Benchmark okay second
[27:13] thing this is a great question uh chain
[27:15] test contamination uh this is something
[27:18] which I would say is really important in
[27:22] Academia in uh given that the talk is
[27:25] mostly about training large language
[27:26] models uh for companies it's maybe not
[27:29] that important CU they know what they
[27:31] trained on uh for us we have no idea so
[27:35] for us it's a real problem uh so there
[27:37] are many different ways of trying to
[27:39] test whether uh the test set sorry
[27:43] whether the test set was actually in the
[27:44] training Set uh one kind of cute trick
[27:48] um that people uh in in the lab on T lab
[27:52] have found is that what you can do is
[27:54] that given that most of the data set
[27:56] online are not randomized
[27:58] you can just look at and in that
[28:00] language models what they do is just
[28:02] predict the next word um you can just
[28:04] look at the entire test Set uh what if
[28:07] you generate all the examples in order
[28:10] versus all the examples in a different
[28:13] order and if it's more likely to
[28:15] generate a thing in order given that
[28:17] there's no real order there then it
[28:19] means that probably was in a training
[28:20] set does that make sense um so there are
[28:23] many that's like one of them there are
[28:24] many other ways of doing it train test
[28:27] contamination again not that important
[28:28] for development really important for
[28:30] academic
[28:32] benchmarking great so there are many
[28:34] other challenges but uh I'll move on for
[28:36] now great data um so data is another
[28:41] really big topic um at a high level
[28:44] people just say oh you basically train
[28:46] large language models on all of Internet
[28:48] what does that even mean um so or people
[28:51] sometimes say all of clean internet
[28:53] which is even less defined um so
[28:56] internet is very dirty and really not
[28:58] representative of what we want in
[29:00] practice if I download a random website
[29:03] right now you would be shocked at what
[29:05] is in there it's definitely not your
[29:07] Wikipedia um so I'll go really briefly
[29:12] on like what people do um I can answer
[29:14] some questions but I mean data is on its
[29:17] own is a huge topic uh basically first
[29:20] what you do is download all of Internet
[29:22] what that means is that you use uh web
[29:24] crowlers that will go on every web page
[29:27] on Internet or every web page that is um
[29:30] on Google uh and that is around 250
[29:34] billion pages right now um and that's
[29:36] around one petabyte of of data so this
[29:39] is actually a common common C is one web
[29:42] crowler so people will usually write
[29:43] their own web crowlers what they do is
[29:45] that they use standard web crowlers and
[29:47] we common crawl is one of them uh that
[29:50] basically every month adds all the new
[29:52] websites that were added on uh internet
[29:55] that are found by by Google and they put
[29:57] it in a big uh basically a big data set
[30:00] um so that's on common call you have
[30:02] around 250 billion pages right now so 1
[30:05] E6 gigabytes of data once you have this
[30:09] uh so this is a random web page like
[30:11] literally random uh from this common
[30:13] craw and what you see is that one it
[30:15] really doesn't look at type of things
[30:17] that you would usually see but actually
[30:19] so this is an HTML page uh it's hard to
[30:22] see but if you look through you will see
[30:25] some content for example here here uh
[30:29] tesing world is your ultimate source for
[30:32] the system X high performance server and
[30:34] then you have three dots so you don't
[30:35] even the sentence is not even finished
[30:37] that's how a random internet looks like
[30:40] uh so of course it's not that useful if
[30:42] you just train a like large language
[30:44] model to generate things like this so
[30:46] what are some of the steps that are
[30:47] needed first one you extract the text
[30:50] from the HTML so that's what I just try
[30:52] to do by looking at uh basically the
[30:54] correct text uh there are a lot of
[30:56] challenges by through this for example
[30:58] extracting math is actually very
[31:00] complicated but pretty important for
[31:01] training large language models um or for
[31:04] example boiler plates a lot of your
[31:06] forums will have the same type of
[31:07] headers the same type of Footers uh you
[31:10] don't want to repeat all of this in your
[31:12] data um then you will filter undesirable
[31:15] content uh so not safe for work harmful
[31:19] content pii uh so usually every company
[31:21] has basically a a black list of websites
[31:25] that they don't want to train the models
[31:27] on that Black List is very long and you
[31:29] basically say if it comes from there we
[31:31] don't train on this there are other ways
[31:32] of doing these things is that you can
[31:34] train a small model for classifying what
[31:36] is pii removing these things um it's
[31:40] hard every Point here that I'm going to
[31:42] show you is like a hard amount of work
[31:46] uh but I'm going to go go quickly
[31:47] through it so filter undesirable content
[31:50] second or fourth is the dup D
[31:53] duplication as I said um you might have
[31:56] things like headers and Footers in
[31:58] forums that are always the same you want
[32:00] to remove that another thing that you
[32:01] might have is a lot of URLs that are
[32:04] different but actually show the same
[32:06] website um and you might also have a lot
[32:10] of like U um paragraphs that come from
[32:13] like common books that are basically
[32:14] duplicated a thousand times or 10,000
[32:17] times on internet so you have to
[32:19] duplicate also very challenging uh
[32:22] because you have to do that at scale
[32:24] once you do duplication you will do some
[32:26] heuristic filtering you will try to
[32:28] remove low quality documents uh the way
[32:31] you do that are things like rules-based
[32:33] um filtering for example if you see that
[32:36] there are some outlier tokens if the
[32:38] distribution of tokens in the website is
[32:39] very different than the usual
[32:40] distribution of tokens then it's
[32:42] probably some outlier if you see that
[32:44] the length of the words in this website
[32:46] is super long there's something strange
[32:48] going on on that website if you see that
[32:50] the the website has only three words
[32:52] maybe is it worth training on it maybe
[32:54] not if it has like 10 million words
[32:56] maybe there's something also
[32:58] wrong going on that page um so a lot of
[33:00] rules like this yes why we filter out
[33:03] undesirable content from our dat set
[33:05] instead of kind
[33:06] of putting it in is like a supervised
[33:09] loss right like can we not just say like
[33:12] you know here's this like hate speech
[33:13] website let's actively try to Let's
[33:17] actively penalize the for generating
[33:19] we'll do exactly that but not at this
[33:22] step that's where the posttraining will
[33:24] come from uh pre-training um the idea is
[33:28] just to say I want to model kind of how
[33:32] humans speak essentially um and I want
[33:35] to remove all these like headers photos
[33:36] and and menus and things like this but
[33:38] it's a very good uh like idea that you
[33:41] just had and that's exactly what we'll
[33:42] do
[33:44] later Next Step modelbased filtering so
[33:47] once you filtered a lot of data what you
[33:49] will do uh that's actually a very cute
[33:51] trick uh you will take all of Wikipedia
[33:53] and you will look at all the links that
[33:56] are linked through Wikipedia p
[33:58] because probably if something is
[33:59] referenced by Wikipedia it's probably
[34:01] some high quality website and you will
[34:03] train a classifier to predict whether
[34:06] something comes from whether a document
[34:09] comes from one of these references uh
[34:12] from Wikipedia or whether it's from the
[34:14] random web and you will try to basically
[34:16] say I want more of the things that come
[34:19] from Wikipedia references does that make
[34:22] sense so yeah so you will train a a
[34:25] machine learning uh model usually also
[34:27] very simp simple models because you need
[34:28] to do that really at scale I mean just
[34:30] think about the 250 billion
[34:32] Pages uh next one you will try to
[34:36] classify your data into different
[34:38] different um domains you will say okay
[34:41] this is entertainment this is books this
[34:43] is code this is like these type of
[34:45] domains and then you will try to either
[34:49] um up or down weight some of the domains
[34:52] uh for example you might say uh you
[34:54] might see that actually if you train
[34:56] more on code then actually your model
[34:58] becomes bettered on reasoning so that's
[34:59] something that people usually say in a
[35:01] very handwavy way if you train your
[35:03] model more code actually it helps
[35:04] reasoning so you want to upweight the
[35:07] coding uh distribution because that
[35:09] helps for General language modeling
[35:10] skills uh books is usually also another
[35:13] one that people usually um upweight
[35:16] entertainment they usually downweight uh
[35:18] so things like this of course you want
[35:20] to do it so people used to do it maybe
[35:23] uh kind of theistically now there's
[35:25] entire pipelines that we'll talk about
[35:28] of how to do these things uh slightly
[35:30] more um
[35:32] automatically and then at the end of
[35:34] training uh usually train um after
[35:38] training on all of this data that we saw
[35:40] usually train on very high quality data
[35:42] at the end of of training your large
[35:45] language model where you decrease your
[35:46] learning rate uh and that basically
[35:48] means that you're kind of overfitting
[35:50] your model on a very high quality data
[35:52] so usually what you do there is like
[35:54] Wikipedia you basically overfit on
[35:57] Wikipedia yeah and you overfit on like
[36:00] human uh data that was collected um the
[36:04] other things like continual pre-training
[36:06] for getting longer context I'm I'm going
[36:08] to skip over all of these things uh but
[36:10] I just to give you a sense of how hard
[36:12] it is when people just say oh I'm going
[36:13] to train on internet that's a lot of
[36:16] work um and really we haven't figured it
[36:18] out yet so collecting World data is a
[36:22] huge part of practical large language
[36:24] model uh some might say it's actually
[36:26] the key yes
[36:28] about data so basic question so usually
[36:30] when you start with like the terabyte of
[36:33] data after I go through all that steps
[36:35] the typical amount of data you have in
[36:37] and then like how how large a team does
[36:40] it typically think to go through all the
[36:42] steps you talk about so how is the
[36:44] question how large is the data after you
[36:46] filter yeah after you filter and then to
[36:48] go through all the step how large a team
[36:50] do you need to go through like the the
[36:52] other fation sttion uh how slow is it or
[36:56] how like how how many people would you
[36:58] need to be able to do this uh okay
[37:02] that's a great question I'm going to
[37:04] somewhat answer about the data uh how
[37:06] large is the data set uh at the end of
[37:08] this slide uh for number of people that
[37:12] work on
[37:13] it um that's a good question I'm
[37:15] actually not quite sure but I would
[37:18] say yeah I actually don't quite no but I
[37:22] would say it's probably even bigger than
[37:24] the number of people that work on kind
[37:26] of the two tuning of the pre-training of
[37:28] the model uh so the data is bigger than
[37:31] kind of the modeling aspect um yeah I I
[37:35] don't think I have a good sense I would
[37:38] say probably in Lama's team which have
[37:40] like 70 years people I would say maybe
[37:42] 15 work on data uh I yeah all these
[37:47] things you don't need that many people
[37:48] you need a lot of computer so because
[37:49] for data you need a lot of CPUs um so
[37:53] yeah and I'll answer the second question
[37:54] at the end of this slide so as I just
[37:57] kind of alluded to really we haven't
[38:00] solved data at all for pre-training so
[38:02] there's a lot of research that that has
[38:03] to be done first how do you process
[38:05] these things super efficiently uh second
[38:07] how do you balance kind of like all of
[38:09] these different domains uh can you do
[38:11] synthetic data generation that's
[38:12] actually a big one right now uh and
[38:15] because we don't have uh we'll talk
[38:16] about that later we don't have enough
[38:18] data on the internet um can you use
[38:21] multimodal data instead of just text
[38:23] data and how does that improve even your
[38:25] text performance um
[38:28] there's a lot of seccy because really
[38:30] this is the key of most of the pre-train
[38:32] pre-trained large language models so for
[38:34] competitive Dynamics uh usually these
[38:37] these um these companies don't talk
[38:40] about how they do the data collection
[38:41] and also there's a copyright liability
[38:43] issue they definitely don't want to tell
[38:44] you that they've trained on books even
[38:46] though they did um because if not you
[38:48] can uh sue them uh common academic
[38:51] benchmarks uh so that will kind of
[38:53] answer what you asked um it started so
[38:55] those are the smaller ones it's the
[38:57] names are not that important but it
[38:59] started from around 150 billion tokens
[39:02] which around uh 800 GB of data now it's
[39:05] around 15 trillion of to 15 trillion
[39:07] tokens which is also uh the size of the
[39:10] models that are right now the best
[39:12] models are probably trained on that
[39:13] amount of data so 15 trillion tokens uh
[39:16] which is probably I guess two order of
[39:19] manage bigger than that so 80 uh E3 gab
[39:23] so that would be
[39:25] around 100 to thousand times uh
[39:28] filtering of the common crawl if I'm not
[39:31] mistaken um so yeah one very one very uh
[39:35] famous one is the pile so this is
[39:37] academic Benchmark of the pile and we
[39:39] can just look at what distribution of
[39:41] data they have it's things like um
[39:44] archive PBM Central uh which is all the
[39:47] the biology stuff uh here it's Wikipedia
[39:52] you see stack exchange um some GitHub
[39:56] and some books and things like this um
[39:58] again this is on the smaller side so
[40:00] this is if we look at here this is on
[40:01] 280b so in reality it's like 100 times
[40:04] bigger so you cannot have that much of
[40:05] GitHub and and of
[40:07] Wikipedia um in terms of close Source
[40:10] models just to give you an idea uh Lama
[40:13] 2 um it was trained on 20 two trillion
[40:16] tokens lamb 3 15 trillion tokens which
[40:19] is currently the best model that we know
[40:21] on how much it was trained on which is
[40:22] the same thing as this the the the best
[40:25] academic or the biggest academic
[40:27] Benchmark which is 15 trillion tokens
[40:29] GPD 4 we don't really know but it's
[40:30] probably in the same water of magnitude
[40:32] or it's probably around that actually
[40:33] it's probably around 13 um from leaks if
[40:36] the leaks are true
[40:39] um great so scaling laws um any other
[40:43] questions on Data before you go to
[40:45] scaling
[40:48] laws sorry I know I'm giving you a lot
[40:50] of information but uh there's a lot into
[40:52] training at large language models great
[40:55] scaling laws so so the idea is that what
[40:58] people saw um around 2020 or at least
[41:01] from a long time but they've been able
[41:03] to kind of theoretically show it or
[41:06] impurely show it since 2020 is that the
[41:08] more data you train your models on and
[41:10] the larger the models the better the
[41:11] performance this is actually pretty
[41:13] different than what you've seen in this
[41:15] class in this class we teach you about
[41:16] overfitting overfitting doesn't happen
[41:18] with large language models uh larger
[41:21] models better performance um it's
[41:24] something that really took a long time
[41:25] for the community who took this type of
[41:27] class to realize um but for the exam
[41:31] overfitting
[41:32] exists so okay the idea of scaling laws
[41:36] is that if given that you know that more
[41:38] data and larger models will always give
[41:41] you better performance can we predict
[41:44] how much better your performance will be
[41:46] if you increase the amount of data and
[41:48] the size of your model and surprisingly
[41:51] it works uh so here you see three plots
[41:53] from a very famous paper called scaling
[41:55] loss from openi um here you see on the
[41:58] x-axis compute so how much did you train
[42:01] like how much compute did you did you
[42:02] spend for training and here you see test
[42:04] loss so this is essentially I mean it's
[42:07] not perplexity but it's your validation
[42:08] loss um so it's a log of the perplexity
[42:11] and if you put these two on uh log scale
[42:15] uh then you see that uh the the
[42:17] performance or like the this the sorry
[42:20] the the scaling law is linear uh that
[42:22] means that if you increase your compute
[42:25] by a certain amount you can you can say
[42:26] by how much your test loss will actually
[42:29] decrease same thing with data and same
[42:32] thing for parameters if you increase the
[42:34] data set size your loss will will
[42:36] decrease by an amount that is somewhat
[42:39] predictable if you increase the number
[42:40] of parameters it will decre the loss
[42:43] will decrease by amount which is
[42:44] somewhat predictable this is really
[42:46] amazing um very surprising I mean it
[42:49] looks in nocuous when you look at these
[42:51] type of plots but that's crazy because
[42:53] it means that you can predict uh how
[42:55] well we're going to perform in 2 3 years
[42:58] depending on how much compute we will
[42:59] add assuming that these things will hold
[43:01] there's nothing theoretical about it um
[43:04] yes two things one what is the loss that
[43:07] they're using here is this perplexity or
[43:09] so it's it's you know I said perplexity
[43:11] was like two to the power of the LW so
[43:13] this is the the the power of the
[43:16] perplexity and then the second thing is
[43:18] when you like increase the number of
[43:20] parameters or you increase the total
[43:21] data set size going dat times doesn't
[43:25] that just inherently increase your
[43:27] compute like do all this work to
[43:31] just specific no this is a great
[43:33] question so the compute here is actually
[43:35] a factor of two things the data and the
[43:37] parameter what I'm showing here is that
[43:39] you can um well actually we're going to
[43:40] talk about that in details but basically
[43:42] if you increase the number of parameters
[43:44] you should increase the number of data
[43:46] that you have um so you actually don't
[43:49] go multiple times through the same data
[43:50] set no one does EPO in a lar at least
[43:55] not yet uh because we have still kind of
[43:58] enough data um so yeah this is all the
[44:01] same Trend which is increase compute
[44:03] decrease
[44:04] loss yes have we seen the numbers for
[44:07] the last two years or is it still
[44:10] holding it is still holding I I don't
[44:14] have like good numbers to show you uh
[44:16] but it is still holding
[44:20] surprisingly yes is there no evidence
[44:22] like empirical evidence that you
[44:25] plateau expected PL
[44:28] no empirical evidence of plateauing
[44:30] anytime soon um why we don't know um
[44:36] will it happen probably I mean it
[44:38] doesn't need to because it's actually in
[44:39] log scale so it's not like as if it had
[44:43] to go it had to Plateau like
[44:45] mathematically it could continue
[44:46] decreasing like this I mean most people
[44:48] think that it will probably Plateau at
[44:49] some point we don't know
[44:51] when um okay so that's I'll talk more
[44:55] about scaling laws now
[44:57] so why are scaling laws really cool
[45:00] imagine that I give you um you're very
[45:02] fortunate I gave you 10,000 gpus for
[45:04] this month what model will you train how
[45:07] do you even go about answering that
[45:08] question and I mean this is a a
[45:11] hypothetical but that's exactly what
[45:13] these companies are faced with uh the
[45:16] old pipeline um which was basically you
[45:19] tune High parameters on the big models
[45:21] so let's say I have 30 days I will train
[45:24] 30 models for one day each I will pick
[45:27] the best one uh and that will be the
[45:29] final model that I will use in
[45:31] production um that means that the model
[45:33] that I actually used was only trained
[45:34] for one day the new pipeline is that you
[45:38] first find a scaling recipe so you find
[45:40] something that tells you for example oh
[45:43] like one common thing is that if you
[45:44] increase the size of your model you
[45:46] should decrease your learning rate so
[45:47] you find a scaling recipe such that you
[45:49] know if I increase the the the the size
[45:52] of my model here's what I should do with
[45:53] some high parameters then you tune your
[45:56] high parameter
[45:57] on smaller models of different sizes
[46:00] let's say I will say for 3 Days of my 30
[46:03] days I will train many different models
[46:05] and I would do highper parameter tuning
[46:07] on these small models each of different
[46:08] sizes then I will fit a scaling law and
[46:11] try to extrapolate from these smaller
[46:14] models which one will be the best if I
[46:17] if I train it for much longer or sorry
[46:20] if I train it for a larger model and
[46:23] then I will train the final huge model
[46:24] for 27 days instead of just one day
[46:27] um so the new pipeline is not train
[46:31] things or do high prity tuning on the
[46:33] real scale of the model that you're
[46:34] going to use in practice but do things
[46:36] on smaller ones at different scales try
[46:39] to predict how well they will perform
[46:41] once you make them bigger I will give I
[46:43] will give you a very concrete example
[46:45] right now uh let's say Transformers
[46:48] versus lstms let's say you you have
[46:50] these 10,000 gpus you will not sure
[46:52] which one you should be using should I
[46:53] be using Transformer based model or LCM
[46:55] based model what I will do is I will
[46:57] train Transformers at different skills
[47:00] so here you see different parameters on
[47:01] the x-axis Y axis is my test loss I will
[47:04] then train different different lstms at
[47:07] different scales once I have these
[47:09] points I will see oh it kind of fits a
[47:11] scaling law I will fit my scaling law
[47:13] and then I will be able to predict oh if
[47:16] I had 10 times more compute here's how
[47:19] well I would perform for the LM it's
[47:21] actually slightly less linear for the
[47:22] lstm but like you could probably try to
[47:25] predict where you would end up and
[47:26] clearly from this plot you would see
[47:28] that Transformers are better um one
[47:31] thing to notice when you read these type
[47:32] of scaling laws is that are two things
[47:34] that are important uh one is really your
[47:38] scaling rate uh which is kind of the uh
[47:42] the slope of the the slope of the
[47:45] scaling law the other thing is your um
[47:48] your intercept like you could start
[47:51] worse but actually become better over
[47:53] time it just happens that lstms are
[47:55] worse for both uh but I could show you
[47:57] another one where things you can predict
[47:59] that actually after a certain scale
[48:01] you're better off using that type of
[48:03] model than others uh so that's why
[48:05] scaling laws are actually really
[48:07] useful any questions on
[48:11] that yeah so these are all kind of very
[48:15] how how sensitive are these to like
[48:17] small differences in the architecture
[48:18] like one one like Transformer
[48:21] architecture versus another Transformer
[48:23] architecture you basically have to like
[48:25] fit your own curve and make basically
[48:27] say like oh scaling law has tell me
[48:28] there should be some like logarithmic
[48:30] function let me extrapolate that for my
[48:34] own yeah so uh usually for example if
[48:37] you're an academic and you want to now
[48:39] at least that's like pretty recent and
[48:41] you want to propose a new like
[48:42] activation uh that's exactly what you
[48:44] will do you will fit a scaling law show
[48:46] another scaling law with the standard
[48:48] like I don't know G and you will say
[48:50] that it's better in reality once you
[48:51] start thinking about it in scaling loss
[48:53] terms you really realize that actually
[48:55] all the architecture differences that we
[48:57] can make like the small minor ones all
[48:59] they do is maybe change a little bit the
[49:01] The
[49:02] Intercept but really that doesn't matter
[49:05] uh cuz just train it for 10 hours longer
[49:07] or like wait for the next uh for the
[49:09] next Compu gpus and these things are
[49:11] really secondary which is exactly why I
[49:12] was telling you originally people spend
[49:14] too much time on the architecture and
[49:15] losses um in reality these things don't
[49:18] matter as much data though if you use
[49:20] good data you will have much better
[49:22] scaling loss than if use bad data so
[49:24] that really matters
[49:27] uh another really cool thing you can do
[49:28] with scaling laws is that you can ask
[49:30] yourself uh how to optimally allocate
[49:33] training resources should I train larger
[49:36] models because we saw that it's better
[49:38] when you train larger models but we saw
[49:40] that it's also better when you use more
[49:41] data so which one should I do should I
[49:44] just train on more data a smaller model
[49:45] or should I train a larger model on less
[49:47] data um so chinchilla is a very famous
[49:52] paper that first showed this uh the way
[49:54] they did it I want to give you a little
[49:55] bit of a sense of what these plots are
[49:58] uh here you see training loss again on
[50:00] the x-axis you see parameter parameter
[50:02] differences uh sorry parameter size uh
[50:04] number of parameters so the size of the
[50:05] model and here all these curves are what
[50:07] we call isof flops which is that all the
[50:11] models on this curve H have been trained
[50:14] with the same amount of
[50:16] compute um the way that you do that is
[50:18] that you train you change sorry you vary
[50:20] the number of tokens that we trained on
[50:22] and the size of the models but you vary
[50:24] in such a way that the total compute is
[50:26] constant
[50:27] okay so all these curves that you see
[50:28] with different colors have different
[50:30] amount of computers that were trained on
[50:32] then you take the best one for each of
[50:34] those curves once you have the best one
[50:36] for each of those curves um you can ask
[50:40] you can plot um how much flops it was
[50:44] and which curve were you on and how much
[50:46] parameters did you actually use for
[50:49] training that specific point you put
[50:51] that on the on the log log uh scale
[50:54] again and now you fit a scaling law
[50:56] again so now I have something which
[50:58] tells me if I want to train a model of
[51:01] 10^ 23 flops here's exactly the number
[51:04] of parameters that I should be using 100
[51:07] 100b and you can do the same thing with
[51:09] flops and
[51:10] tokens so now you can predict if if I
[51:14] tell you exactly I have one month of
[51:16] compute what size of model should I be
[51:18] training F your scaling law and I tell
[51:20] you um of course that all looks
[51:23] beautiful in reality like there's like
[51:25] there's a lot of like small things of
[51:26] like should you be counting like
[51:27] embedding parameters like there's
[51:29] there's a lot of complexities but if you
[51:31] do things well these things actually do
[51:34] hold um so the optimal number of
[51:37] parameters that that chinchilla Pap have
[51:39] found is to use 20 tokens for every
[51:42] parameter that you train uh so if you
[51:44] add one more parameter you should add
[51:45] you should train your thing on your
[51:47] model on 20 more tokens so one caveat
[51:50] here is that this is optimal training
[51:52] resources so that is telling me if you
[51:54] have 10^ 23 FL
[51:57] or if you have like 100 I don't know how
[51:58] much that is100 million or 10 no that's
[52:02] much less actually let's say I have $5
[52:03] million to to train my best model that
[52:06] gets the lowest loss how how what would
[52:08] I train on in reality these companies
[52:11] need to think about inference also if
[52:13] you have a smaller model they will spend
[52:16] less over time um so actually if you
[52:18] consider the inference cost you have
[52:20] other papers that Tred to show that um
[52:22] it's around
[52:24] 150 uh parameters per sorry tokens per
[52:27] parameters because you prefer having a
[52:29] smaller model cuz over time you're going
[52:32] to you're going to actually um spend
[52:35] less money on inference of these models
[52:37] so 150 to one that's around what the
[52:40] best models are trained on right now at
[52:43] least the ones that are that are used um
[52:46] in practice for in
[52:49] production
[52:51] great any question on
[52:55] chin great oh sorry in practice how
[52:58] expensive is inference for these models
[53:01] rela to
[53:02] train actually very expensive uh I will
[53:05] not talk about inference because that
[53:06] would be another entire lecture but just
[53:09] think about Chad GPT where they have I
[53:12] don't know how much it is now like 600
[53:14] million people that used it um like
[53:19] that's a lot
[53:21] um yeah so it's actually very expensive
[53:24] there's a lot of optimization you can do
[53:26] for in though um and that's an entire
[53:28] other lecture so I'm going to skip that
[53:30] uh this time but it's very
[53:32] interesting okay tuning um as I said
[53:35] there are many things that you can uh
[53:37] answer with scaling laws I just try to
[53:39] give you two examples uh but really
[53:41] there are many things what data do you
[53:43] use what mixture what data mixing
[53:45] waiting you use data mixtures that's
[53:47] what we talked about before uh what
[53:49] architecture you use whether you should
[53:51] make your models uh wider or deeper um
[53:54] should you be paying for more gpus or
[53:56] actually collecting more data um all
[53:59] these things are things you can try to
[54:00] answer with scaling
[54:02] laws one thing I want to say is the bit
[54:05] lesson if you ever heard of Richard
[54:07] sudden a very famous blog post in 2019
[54:10] um what he realized uh which I think not
[54:15] enough people realize I didn't
[54:17] definitely did not realize at that time
[54:19] um is that once you see these type of
[54:21] scaling laws you know that the more
[54:23] compute you have the better models you
[54:25] will get so with skill you will get
[54:27] better model and you also know by Mo law
[54:30] or these type of variant of Mo law that
[54:32] you will always have better compute then
[54:34] the only thing that matters is just to
[54:37] have architectures that can leverage
[54:39] computation so what matters is basically
[54:41] systems data and less so the
[54:44] architecture like the small architecture
[54:46] differences like your your your
[54:47] activation and things like this uh so I
[54:50] think that's like one of the reasons why
[54:51] most of research focuses on um some
[54:54] things that for industry matters less
[54:56] and I was one of those researchers for a
[54:59] large part of my my career um so don't
[55:02] spend time over complicating do the
[55:05] simple things do it well seal them
[55:08] that's really what openi taught us with
[55:10] um with chat gpg and with all the gpts
[55:14] before okay I want to give you some
[55:16] backup the envelope computation so I
[55:19] might be off by a few factors here but I
[55:20] just want to give you a sense of how
[55:22] costly it is to train some of these
[55:24] models I'll give as an example
[55:26] Lama 3 400b which is currently the best
[55:29] open source model that you can get uh it
[55:32] was trained on 15.6 tokens it has 45
[55:36] billion parameters so just now that you
[55:38] know what is like this uh optimal tokens
[55:41] per parameter that's around 40 so that's
[55:43] a little bit more than chinchilla but
[55:45] less than this like inference uh optimal
[55:49] um model so they went for training
[55:52] optimality uh flops for this model so
[55:55] one simple uh way to compute flops is
[55:57] six uh times the number of parameters
[56:00] times the number of data you train on uh
[56:03] so if you do the simple calculation here
[56:04] it's 3.8 e25 flops the reason why this
[56:08] is important is that if you follow the
[56:10] little bit the news there's an executive
[56:11] order from Biden that basically says
[56:13] that once you have uh 1 e26 parameters
[56:17] uh sorry flops uh then you have special
[56:20] scrutiny on your models so they went 2x
[56:22] less than that so they really went right
[56:24] below this to not have special scrutiny
[56:27] so 38 uh I might be off by a little bit
[56:29] but it's definitely under the 1
[56:34] 26 oh um so paramet p is parameters n is
[56:40] data number of tokens this is a uh this
[56:43] is just an
[56:44] approximation we
[56:47] yeah okay uh compute and we know that
[56:51] they trained on 16,000
[56:53] h100s um and we know the throughput but
[56:56] they they said it too uh so if you do
[56:59] the computation it takes around 70 days
[57:02] um or 26 million GPU hours at least
[57:05] that's with my uh back of the envelope
[57:07] computation they actually said that they
[57:09] use 30 million instead of 26 million GPU
[57:12] hours um so maybe they had like some uh
[57:16] some challenges I don't really know but
[57:18] if you follow the simple computation
[57:20] it's around 70 days um cost uh I mean
[57:24] this it's hard to to approximate but I'm
[57:27] just going to say it's kind of the rent
[57:29] like what if I were to rent h100s that
[57:32] many h100s for that many days how much
[57:35] will I pay uh h100 a lower bound on the
[57:38] on the renting uh cost of h100 is around
[57:41] 2 hours uh $2 per hour so if you
[57:44] multiply this by 26 million uh hours uh
[57:48] you get 52 million uh dollars so they
[57:51] probably pay less than that but not
[57:53] actually much less because all these um
[57:57] all these services that actually rent
[57:58] gpus they don't make that much money so
[58:00] it's it's probably slightly less but not
[58:02] that much less um now salary I said 50
[58:06] employees 500k per
[58:09] year say yeah it's probably the right
[58:11] ballpark 25 million uh so if you put all
[58:14] together around 75 million um dollars
[58:17] for
[58:18] training uh this Slammer model I'm
[58:21] probably off by like 10 million but but
[58:23] that's kind of right uh bpk
[58:27] carbon emitted um a lot of people might
[58:30] ask like also the cost is not the only
[58:32] thing that is important so I did the
[58:34] computation um it's around 4 uh 4,000 um
[58:40] tons of CO2 equivalent that is actually
[58:43] only 2,000 return tickets from JFK to uh
[58:46] London so right now uh carbon emitted is
[58:50] actually not uh I mean it's huge but
[58:52] it's not like um meaningful yeah yet I
[58:56] think in maybe GPT 6 gpt7 once you
[59:01] multiply this by 100 that might become a
[59:03] real issue right now it's still not uh I
[59:06] think um an issue in the grand scheme of
[59:08] things next model the way you should be
[59:11] thinking about these models is that
[59:12] every new generation the number of flops
[59:15] essentially uh multiplies 10x or at
[59:17] least that's what they try uh if they
[59:19] have enough energy and if they can buy
[59:21] enough
[59:22] gpus uh great any question on these back
[59:25] of the envelope math
[59:29] no
[59:31] okay so now we talked about pre-training
[59:34] I wanted to also chat about systems
[59:36] because now we know computer is really
[59:38] important so there's a question of how
[59:39] do you optimize the how do you optimize
[59:42] your computer I will leave that for the
[59:44] end because I'm not sure how much time
[59:45] we will have I think it's important but
[59:47] hopefully I I'll be able to to talk
[59:49] about it later it's slightly different
[59:51] than what we've been talking about right
[59:53] now so I'll move on to post training for
[59:55] now
[59:56] so the task of post training ER the
[59:59] reason why we need to do Post training
[60:01] is as I told you before um it's to make
[60:04] AI assistants so language modeling is
[60:07] not uh really the thing that you want
[60:10] when you have an AI assistant uh for
[60:12] example if you ask to gbd3 which is a
[60:15] purely language Model A pure language
[60:17] model not a um not an aligned one if you
[60:20] ask a question like explain the moon
[60:22] landing to a
[60:23] six-year-old the completion that you
[60:25] would get is something like explain the
[60:27] theory of gravity to a six-year-old
[60:29] because what it learned is that on on on
[60:31] internet if you have one question you
[60:33] usually have maybe another bullet point
[60:35] of other similar questions you don't
[60:37] usually have question and then answer
[60:38] later uh this is not what you want from
[60:41] an AI assistant so how do we uh do this
[60:45] alignment which is this post training
[60:46] and making these models
[60:48] assistance um so the goal of this
[60:51] alignment is to basically get LMS follow
[60:54] the instructions that are given um by
[60:56] users and and maybe some designers kind
[61:00] of desires um so think about moderation
[61:03] you don't want the model like open ey
[61:05] definitely doesn't want the model to say
[61:07] stuff that is very
[61:08] toxic um so here you see on the left
[61:11] hand side uh that when you ask a
[61:13] question it actually provides a a real
[61:15] answer so it's not like uh before the
[61:17] llm and on the right hand side you see
[61:19] that it would if you ask to write a
[61:21] tweet describing how a certain part of
[61:25] the population are evil it will say that
[61:27] it cannot do that um so that's kind of
[61:31] this
[61:31] alignment uh the background here is that
[61:35] uh basically the data that you want for
[61:38] training some of these models um is like
[61:41] we know what we want which is just
[61:43] asking humans this is a question this is
[61:44] the answer that you want uh but the
[61:46] thing is that it's very expensive to
[61:48] collect that data and it's hard to find
[61:49] it online uh in contrast pre-training
[61:52] data is not what you want but there's a
[61:55] lot of it um so what what we will do a
[61:57] the main idea is simply take a pre-train
[62:00] large language model pre-train all of
[62:02] internet and then you just fine tune so
[62:03] you just change a little bit of weights
[62:05] on the type of data that you actually
[62:06] want and hopefully given it you already
[62:08] pre-train it on all of Internet it
[62:10] basically learns or knows how to speak
[62:12] in English and and knows a standard um
[62:16] language syntax uh then you can really
[62:19] find tune in with very little
[62:22] data okay sft so supervis fine tuning is
[62:26] really exactly what I just said which is
[62:27] the idea of fine-tuning the large
[62:29] language model on uh basically the
[62:32] desired answers that are collected from
[62:34] humans um so why is it called supervis
[62:37] fine tuning because you basically want
[62:38] to do language modeling on the real
[62:41] ansers so language modeling is this like
[62:42] next word prediction and and that's the
[62:44] fine-tuning part and then you want to do
[62:46] it on desired answers given by humans so
[62:48] that's why we call it
[62:50] supervis so how do we collect this data
[62:52] well we I just said it you just ask
[62:54] humans uh to to tell you this is the
[62:56] this is a question this is the answer
[62:58] that you uh you would want from some of
[63:00] these models so this is an example um
[63:03] sorry I can't read very well on my
[63:04] computer but uh my kid uh needs to do a
[63:07] science um no let's read this one can
[63:09] you write a short introduction about the
[63:12] relevance of the term monopsony and then
[63:14] it says monopsony refers to a market
[63:15] structure blah blah blah and that's a
[63:16] human that wrote that um so actually
[63:19] this is open Assistant which was a a way
[63:22] to collect um uh data online by
[63:27] humans so this type of supervised fine
[63:30] tuning or alignment is really the key of
[63:32] Chad GPT this is what made uh the big
[63:35] jump from gpt3 which was mostly
[63:37] something that was known by AI
[63:39] researchers to Chad GPT which became
[63:42] known by basically
[63:44] everyone
[63:46] um so the problem with uh human data is
[63:51] that it's uh very slow to collect and
[63:54] very expensive um so
[63:57] one possible simple idea is to use llms
[64:01] to scale data collection uh so that's
[64:03] exactly what we did with alpaca uh one
[64:06] year ago what we did is that we asked uh
[64:08] humans or we use a data set of human uh
[64:11] question answers so there were 175 uh
[64:14] question answers here and we asked the
[64:15] best mod at the time so text3 to
[64:18] basically generate many more of these
[64:21] question and answers so all we did is
[64:23] like this is what humans would write now
[64:25] write similar answers and similar
[64:26] questions and we collected 52,000 LM
[64:30] generated question answers and then what
[64:32] we did is simply we took Lama 7B which
[64:34] was the best pre-train model at the time
[64:36] and we just fine- tuned this with
[64:38] supervised fine tuning as I told you and
[64:40] that's how we got um the Alpac s7b
[64:43] model uh and this is the type of data
[64:46] that we collected so things like what
[64:48] does algorithm mean an algorithm is a
[64:50] step by a stepbystep uh set of
[64:52] instruction used to solve a problem or
[64:54] achieve a goal blah blah blah blah so
[64:56] the data is not actually it's actually
[64:58] pretty good given it was LM generated by
[65:00] LMS from essentially two generations ago
[65:04] um so that really started at least for
[65:07] us kind of as an academic replication of
[65:09] chat GPT uh now it really there's a big
[65:12] field of like synthetic data generation
[65:14] of how to use llms to basically make
[65:18] development of llms faster um and by
[65:21] basically by decreasing the amount of of
[65:23] human hours that you need
[65:27] quantity of data so we talked about what
[65:29] type of data and how we collect it um
[65:31] one thing which is surprising with sft
[65:33] is that you don't need that much data uh
[65:36] so what this paper showed this is called
[65:38] Lima is that if you have if you scale
[65:41] the amount of data that use from uh
[65:43] supervised fine training from 2,000 to
[65:45] 32,000 it really doesn't help much so
[65:48] here scaling laws definitely don't help
[65:50] um so the the intuition here is that all
[65:53] you learn um is is you learn how to
[65:56] format your desired answers another way
[65:59] of saying it is that your pre-trained
[66:01] models they essentially model the
[66:03] distribution of every user on internet
[66:05] one that might write bullet points
[66:07] another one that might answer qu answer
[66:09] question with an answer so all you tell
[66:11] your model is like wait you should
[66:13] actually be optimizing more for this
[66:15] type of user than another one so you're
[66:17] not actually teaching it and you're not
[66:19] teaching anything through this um sft uh
[66:23] so supervis fine tuning all you do is
[66:25] you tell the model to kind of optimize
[66:27] for one type of user that it saw already
[66:29] in a pre-train data set so the knowledge
[66:31] is already in the pre-train llm uh and
[66:33] you basically just specialize to one
[66:35] type of
[66:36] user great any question on
[66:40] sft yes so I know it's a big issue with
[66:44] synthetic data where uh if you keep
[66:47] generating data from the same
[66:49] distribution eventually you're not
[66:50] learning a new distribution you're
[66:51] essentially playing with it it just
[66:52] bootstrapping that yeah surely
[66:56] you can't scale that forever right you
[66:57] can't keep going on and generating from
[66:59] the same distribution you hope to learn
[67:01] something new yeah uh so are there it's
[67:03] an active area of research but any
[67:05] thoughts that you have around how people
[67:07] are maybe thinking around this and uh
[67:10] better ways to bootstrap or to give up
[67:12] on this idea and and realize that the
[67:14] chart shows you don't need that many so
[67:16] just get humans to generate 2,000 really
[67:18] good uh yeah so that's a very good
[67:20] question uh so for the data stuff so I'm
[67:23] saying it's not that important for sft
[67:24] but there will be another thing we'll
[67:25] talk about right after where actually
[67:28] data does
[67:29] matter my intuition based on not that
[67:32] much empirical results is that you can
[67:34] still get um even though you use your
[67:37] LMS if you use purely LM generated text
[67:40] and you do that for like three four
[67:42] generations of llms I agree with you
[67:43] that probably you won't improve much but
[67:46] for me what is important is how do you
[67:47] use like human in the loop with llms not
[67:50] purely LMS not purely uh humans but
[67:53] maybe what you can do is just have the
[67:54] model generate some new text and just uh
[67:57] humans write a few Edits edits are much
[67:59] faster than writing the entire text and
[68:02] I think that if you have that type of
[68:03] collaboration then from like kind of an
[68:05] information theoretical point of view
[68:07] you still get additional information but
[68:09] you still much faster than if you use
[68:11] humans and I think that as a field we'll
[68:13] probably move towards these type of
[68:14] things uh which is um really just
[68:17] finding the examples that are important
[68:19] and and asking humans it's kind of
[68:21] active learning just asking humans
[68:22] exactly when uh you need to to get
[68:27] inputs yes do we train with like the
[68:29] same loss function the same like General
[68:32] training algorithm for the supervis
[68:33] tuning bit as we do for the for the
[68:35] pre-training right because like the
[68:37] examples you showed I think the the
[68:39] important thing of the good examples is
[68:43] they're like supera accurate there's
[68:45] these more complex still just like chain
[68:48] same so that's why here I yeah I didn't
[68:51] maybe didn't emphasize enough this is
[68:52] just language modeling fine tun the LM
[68:54] with language model on the desired
[68:56] answers so this is literally the same
[68:57] loss um it will be different in two
[69:01] seconds but the first step of sft is
[69:03] literally the same loss where you just
[69:05] say Okay I want to actually specialize
[69:07] on that type of data so there's even a
[69:09] question of like what is pre-training
[69:10] what is post-training because in reality
[69:12] it's just like a different data that you
[69:13] use the reason why we usually call it
[69:15] post training is that the way we collect
[69:16] that data is very
[69:18] different great great questions uh yes
[69:22] maybe it's the same question but why
[69:23] would these 2,000 examples have such an
[69:26] overweighted
[69:28] influence you tun so that's why we uh
[69:31] also that's another reason why we call
[69:33] it post training is that we use
[69:34] different type of hyper parameters so
[69:35] you know I told you basically at the end
[69:37] of pre training you essentially end up
[69:38] with a learning rate of zero and here
[69:40] you're going to increase your learning
[69:41] rate so like 1 eus 5 one E Yeah and and
[69:44] so um the weight that you give to them
[69:47] is actually
[69:49] different
[69:51] um okay uh Second Step or second part of
[69:56] this post training um is what we call
[69:59] reinforcement learning from Human
[70:00] feedback or rhf uh some of you might
[70:03] have heard of that um the idea is that
[70:06] sft has a problem namely that uh you do
[70:09] behavioral cloning which means that you
[70:11] just try to clone what the humans would
[70:14] say and that had that has many issues
[70:16] one of them is that you're bound by
[70:18] human abilities so if um like humans
[70:23] actually humans won't generate the
[70:26] things that they think is actually the
[70:27] best thing to generate so if you ask me
[70:29] to write a book I mean I can definitely
[70:31] enjoy a book I can probably say one book
[70:33] is better than another but I'm
[70:34] definitely not going to be as good as
[70:36] writing the book that I want to read uh
[70:38] so you're going to be bound by the human
[70:39] ability to generate things even though
[70:41] the humans might be better at
[70:42] distinguishing between things that's one
[70:44] issue issue number two uh I find that
[70:46] actually pretty interesting is that it
[70:48] might if you ever heard of the word
[70:50] hallucination so this is llms generating
[70:53] F like false information
[70:56] hallucination might these people have um
[70:58] hypothesized that that can come from the
[71:00] supervised fine tuning even if you do
[71:02] supervised fine tuning on data that is
[71:05] correct and the reason why that is is
[71:08] that if uh given I told you that
[71:10] basically sftt is with very little data
[71:13] and it's with data that doesn't the
[71:15] model doesn't learn anything new so what
[71:17] if the human gives an answer that the
[71:20] model didn't know was true from the
[71:23] model perspective you the human
[71:25] basically is telling the the model uh
[71:28] generate this thing that seems plausible
[71:31] but actually have no idea if it's true
[71:32] or not um so just to give you a very
[71:35] concrete example if we go back to this
[71:37] uh monopsony example can you write blah
[71:39] blah blah about monopsony uh imagine
[71:42] that a human uh wrote a reference on
[71:44] this type of book um and that book might
[71:47] exist that might be a correct reference
[71:49] but what if the llm never saw this
[71:51] reference during pre-training then it
[71:53] doesn't know that it's a correct
[71:54] reference so really what you tell the
[71:55] model is to generate or make up some
[71:58] plausibly sounding reference um rather
[72:01] than actually tell the real reference
[72:03] that it saw during pre-training uh so
[72:06] hallucination might be um uh a re like
[72:10] might be caused by this sft that's
[72:12] problem number two does that all make
[72:14] sense great problem number three price
[72:18] generating the ideal answers is very
[72:21] pricey and that comes back to your
[72:22] question um of like humans writing
[72:25] answer is actually pretty
[72:27] expensive um so that's where rhf comes
[72:29] in the idea is that instead of cloning
[72:32] the behaviors of humans we're going to
[72:34] maximize human preference um and the way
[72:37] we're going to do that so the pipeline
[72:39] is that for a certain for every
[72:41] instruction you're going to ask a model
[72:43] to generate two answers um and usually
[72:47] use a pretty good model so you usually
[72:48] don't use an LM here you use a sft uh
[72:52] fine tune you use a fine tuned llm
[72:54] already to give like pretty good answers
[72:57] and then you ask labelers which of these
[72:59] two answers was better so select the
[73:01] preferred one and then with different
[73:04] type of algorithms we're going to talk
[73:05] about the algorithms um you just
[73:07] fine-tune the model to generate more of
[73:09] the green thing than the red thing so
[73:10] more of the good stuff uh so now the
[73:13] question is how and we're going to talk
[73:14] about that right
[73:16] now so there are two ways that we're
[73:19] going to talk about and two that are
[73:20] mainly used in the community um the
[73:23] first one is simply the idea of of using
[73:25] reinforcement learning so hopefully you
[73:26] all know what reinforcement learning is
[73:28] now um so when you think about using
[73:32] reinforcement learning one important
[73:33] question is like what is the reward that
[73:35] we're optimizing uh so in this case
[73:37] there are really two options that I
[73:38] could think about the first one you
[73:40] could just say I'm going to compare the
[73:42] output generated by some baseline the
[73:44] output generated by my model U and I'm
[73:46] just going to ask the human to say which
[73:48] one is better and I'm going to use this
[73:51] as a reward so if I'm better than the
[73:52] Baseline this is a plus one if not it's
[73:54] a minus one one uh so now it's binary
[73:56] reward the problem with binary reward is
[73:58] that it's very sparse and you don't get
[74:00] much information out of it uh like maybe
[74:02] your answer was slightly better maybe it
[74:04] was like way better and you don't really
[74:07] know from this um how much better it was
[74:11] so option two is that you can train what
[74:13] we call a reward model which is simply a
[74:15] classifier uh so you use machine
[74:17] learning to to classify how much better
[74:21] uh two outputs are from the preference
[74:24] from the perspective of the human um so
[74:27] this is a little bit meta but what you
[74:29] basically do is that you train uh you
[74:31] take um a reward model R which is a uh
[74:34] just a large also a large um a large
[74:38] classifier and you basically ask this
[74:40] reward model you give it the input and
[74:42] the actual output that you have one of
[74:44] the two outputs uh and you just um
[74:47] exponentiate that so that's the soft Max
[74:49] law that you all know about and now you
[74:51] divide by um the the exponential
[74:55] reward uh on the first example sorry on
[74:59] the first output and this is on the
[75:00] second output and you basically train so
[75:02] the reason why you do that is that you
[75:04] train your your model you train this
[75:06] reward model to be able to classify um
[75:10] how much better one output is to another
[75:13] one so another uh slightly less
[75:15] convoluted way of saying it is that your
[75:16] reward model will output some reward
[75:19] that will be used as the logits of your
[75:21] soft Max so now if you have high logic
[75:25] in your softmax it means that you highly
[75:27] likely this um output is
[75:31] better uh so that's what we call Bradley
[75:33] ter model yes is this reward model going
[75:36] over the entire output or is it
[75:39] going um so this takes the
[75:43] entire uh yeah this takes the entire
[75:46] output at once so it takes all the input
[75:47] and all the output and it gives one
[75:49] number
[75:52] yes would human be sorry with the reward
[75:56] model where would a human be like oh I
[75:59] see okay sorry maybe I wasn't clear um
[76:03] you train this reward model to fit this
[76:06] green and and red preference from humans
[76:09] so basically you train a classifier to
[76:12] say whether the humans prefer red or
[76:14] green uh but instead of using the binary
[76:17] reward which is what the human would
[76:19] tell you you basically use the logits of
[76:22] the soft Max and the thing with the
[76:23] logits is that that logits are
[76:25] continuous so now you know that if your
[76:27] reward model said it has high logits
[76:30] then in some ways the human highly
[76:32] prefer this answer to some other
[76:36] answer great um so as I just said
[76:39] continuous information so it's better so
[76:41] that's what people uh use in practice or
[76:43] at least used to use in practice I'll
[76:45] tell you about uh the other algorithm
[76:47] later uh so what you do at the end is
[76:49] that you basically try to just use
[76:51] reinforcement learning that you know
[76:53] about now we know we have reward what
[76:55] you sample through is the generation
[76:58] from your large language model um and
[77:00] then you just use some regularization
[77:01] term so the reason why you do this
[77:03] regularization term is for avoiding what
[77:05] we call over optimization so this reward
[77:07] model might not be really represent like
[77:09] might not perfectly model human
[77:11] preferences so you don't want to
[77:12] maximize this thing to essentially
[77:15] Infinity um and you do it using uh po
[77:19] which is a common uh reinforcement
[77:22] learning algorithm um one thing to note
[77:25] here because it will be important for
[77:26] later is that when we use maximum
[77:29] likelihood
[77:31] um sorry now the large language models
[77:35] are actually a policy for your
[77:37] reinforcement learning it's not
[77:39] maximizing maximum likelihood anymore
[77:41] which means that you're not modeling any
[77:42] distribution anymore and the reason why
[77:44] this is important is that models that
[77:46] went through this type of Po actually
[77:49] don't give you likelihoods of text that
[77:51] are meaningful cuz what you optimize
[77:53] them to do is B basically just optimized
[77:55] for generating the most likely thing not
[77:58] optimize for modeling like all the
[78:00] answers that humans might say another
[78:02] way of saying that is that there's
[78:04] nothing that incentivizes here the model
[78:06] to not give a like a um a single
[78:10] possible generation nothing here says
[78:13] it's good if you have some distribution
[78:15] with some
[78:16] entropy um okay if you haven't followed
[78:18] it's not that important but just good to
[78:21] knowe great so PO is exact what chat GPT
[78:26] did originally so here's the on the blog
[78:28] post or what they have is step one do
[78:32] supervise fine training which now you
[78:33] all know about step two train a reward
[78:36] model on human preferences step three do
[78:39] po multiple steps which is where you see
[78:41] this this blue arrow so you continue you
[78:43] train the model once with po you collect
[78:45] new data you continue uh and that's why
[78:48] and that's exactly what Chad GPT did uh
[78:50] that was a big breakthrough between gpt3
[78:53] and Chad GPT
[78:55] one thing to note is that uh P has many
[78:58] challenges reinforcement learning is
[79:00] something that's super nice
[79:01] theoretically in practice anyone who
[79:03] ever worked with reinforcement learning
[79:04] knows it's such a mess uh there's a lot
[79:07] of things like roll outs out of Loops
[79:08] clipping so many complications um so
[79:12] it's messy this is the idealized PO used
[79:14] for LM settings so that's already much
[79:16] more complicated than this expectation
[79:18] we saw before and in practice it's
[79:19] actually much more complicated so we
[79:21] have one implementation of it that we
[79:22] had to do and I'm not going to go
[79:24] through it but basically you have like
[79:26] so much stuff that you have to think
[79:27] about when you implement that type of of
[79:30] uh po algorithm so you have clipping
[79:32] everywhere you have a lot of
[79:34] complexities and things are not well
[79:36] documented all this to say um that we're
[79:40] going to there was a new method that was
[79:41] proposed uh also from Sanford one year
[79:44] ago called DPO which is essentially a
[79:46] simplification of Po um and the way uh
[79:51] what they did or the idea that they have
[79:53] is that instead of using reinforcement
[79:55] learning you can just maximize the
[79:57] probability of generating the stuff that
[79:58] you like and minimizing the probability
[80:00] of the stuff that you don't like uh so
[80:02] if you think about the human preference
[80:04] the red and green maximize uh green
[80:07] minimize red um so the loss is actually
[80:10] this one uh where what you see this is
[80:12] simply um some log of the model so this
[80:17] is the likelihood of a model generating
[80:18] the things that the human preferred
[80:20] given the the inputs um and what you try
[80:24] to do is basically
[80:26] maximize uh the likelihood of generating
[80:29] the things that you like minimize the
[80:31] likelihood of the things that you don't
[80:32] like um all the rest of the terms here
[80:35] it's not too important it's actually
[80:37] really not that complicated to
[80:39] understand but at a high level it's
[80:41] really just maximizing the things you
[80:42] like minimizing the the rest um and one
[80:46] thing to note uh which I was going to
[80:48] say just here is that actually all the
[80:50] rest is chosen such that um the global
[80:53] Minima of of Po and a global Minima of
[80:57] like this DPO under some assumptions are
[80:59] essentially equivalent so this is the
[81:02] right thing to do mathematically I'm not
[81:04] going to go through the derivations but
[81:06] that's the right thing to do uh it's
[81:08] pretty different with Po in the sense
[81:09] that now and with P what you had to do
[81:11] is collect the human preferences then
[81:13] train a uh reward model with maximum
[81:15] likelihood then use reinforcement
[81:17] learning now all you do is basically
[81:18] maximum likelihood much simpler yes I
[81:21] mean yeah so it seems like this is a
[81:23] much simpler and B like what you just
[81:25] intuitively do if this why did they
[81:28] start with this reward model like what
[81:30] what led them doing that I think it's a
[81:32] great question uh I don't really know
[81:34] what I can tell you is that at open ey
[81:37] the people who did the um uh who did
[81:40] basically this PP uh sorry who did Chad
[81:43] GPT initially are the ones who actually
[81:46] wrote Po and I think they were just like
[81:48] there are a lot of reinforcement
[81:49] learning people and I think that for
[81:51] them it was very intuitive um so there's
[81:55] also some additional like potential
[81:57] benefits for example I don't want to
[82:01] yeah for example if you use the reward
[82:02] model uh the cool thing here with
[82:04] reinforcement learning is that you can
[82:05] use unlabeled data with the reward model
[82:08] so here you can only use the label data
[82:10] for doing DPO um for PP for po you first
[82:14] train your reward model and then you can
[82:16] use unlabeled data uh where the reward
[82:18] model will basically label this
[82:20] unlabeled data so there there's
[82:21] additional kind of potential uh
[82:25] there could be potential improvements in
[82:27] practice it happens at down and on and I
[82:29] think just that a lot of people in this
[82:32] team were reinforcement learning experts
[82:34] including uh the main author of Po John
[82:37] hman um so much simpler in poo and is
[82:41] basically performs as well uh so now
[82:43] this is the standard uh thing that
[82:45] people use at least in the open source
[82:47] Community I believe it's actually the
[82:48] standard also in in Industry so that's
[82:52] called DPO gains
[82:55] um so those are all the papers on the
[82:57] left here this is on a summarization
[82:59] task you see all I want to show you is
[83:01] that basically the pre-train models uh
[83:03] were okay and they improve with scale if
[83:05] you do supervised fine tuning you
[83:07] improve them a little bit more if you do
[83:09] po or something with all HF with human
[83:11] feedback you get performance that are as
[83:14] often times depending on a benchmark
[83:16] even better than uh humans so this is
[83:19] the human uh reference summaries same
[83:21] thing this is on a uh on a paper that we
[83:23] have Alpaca Farm
[83:25] where we see uh the evaluation here is
[83:27] not too important but basically you see
[83:28] pre-train model you jump to sft and then
[83:31] you jump to PPO and popo have the exact
[83:34] same
[83:35] performance so basically all HF helps
[83:38] that's kind of the conclusion and DPO is
[83:41] simple uh data uh the way that you
[83:44] collect that type of data um first idea
[83:47] is just use humans as we already talked
[83:50] about uh guidelines are very complicated
[83:52] for what humans should be labeling and
[83:54] and it's really not that easy and
[83:55] actually if you ever do some of the
[83:57] labeling you will see that it's
[84:00] extremely complicated like if I zoom in
[84:02] to this uh here I have a question tell
[84:05] tell me about self-driving cars and you
[84:07] read both self-driving cars are vehicles
[84:09] that are capable of detecting their
[84:10] surroundings blah blah blah self-driving
[84:12] cars are cars that are equipped with
[84:13] sensors blah blah blah to navigate
[84:15] without the need for a driver I mean
[84:17] both seem okay like which one is better
[84:19] it's actually hard to say at a glance um
[84:22] and as a result uh the problem with
[84:23] humans is that you will start optimizing
[84:27] a lot of like high level features for
[84:28] example the second one is longer I can
[84:30] guarantee you that most humans will
[84:31] choose second one even though I mean
[84:34] maybe the first one is better I don't
[84:35] know I haven't read it carefully so
[84:38] challenges with humans first slow and
[84:41] expensive uh second as I just mentioned
[84:44] it's hard to focus on things that matter
[84:46] like correctness and people uh usually
[84:49] look at things that don't matter as much
[84:51] like the form like length uh and as a
[84:54] result so what I show here is that uh
[84:55] when you do lhf the more you do of lhf
[84:58] the longer the output of the of the
[85:00] models become so if you've ever been
[85:02] annoyed at chat GPT answering you super
[85:04] long sentences this is because of all
[85:07] rhf um annotator distribution shift uh
[85:11] like the distribution of annotators that
[85:13] you use matters a lot and you have to
[85:15] think like what is what is even the
[85:17] humans that we want to represent in
[85:18] these models uh now the question is like
[85:20] crowdsourcing ethics uh like usually
[85:23] these basically a lot of the the
[85:25] labeling that is done um like the people
[85:28] who do them are not paid well and they
[85:30] have to go through a lot of toxic data
[85:32] uh because you basically want the model
[85:34] to avoid saying the toxic data um so
[85:37] crowdsourcing ethics
[85:39] too so many challenges with human data
[85:42] um so what we did also last year is
[85:45] again the same thing as alpaca just the
[85:47] idea of like oh well they're challenges
[85:49] with humans maybe we can just replace
[85:50] them with llms uh so what we did is
[85:53] simply replace
[85:54] um oh I see that I'm just realizing that
[85:57] the slides are not sented anyways uh you
[86:00] replace a human preference with LM
[86:01] preferences uh so here on this uh figure
[86:04] you see on the xaxis the price that we
[86:06] paid uh for collecting human data it's
[86:09] around
[86:10] $300 for 1,000 examples and this is on
[86:13] mechanical turkers which are usually
[86:15] like cheaper than than maybe some of the
[86:17] other um companies that you could go
[86:20] through and on the Y AIS it's basically
[86:22] the agreement with uh other humans with
[86:25] the mode of other humans and what you
[86:27] see is that actually as I told you
[86:28] before labeling is really complicated
[86:30] humans agree with themselves only around
[86:33] 66% of the time on a binary Tas and it's
[86:36] not that the humans are not good here
[86:38] because uh we were five main authors on
[86:40] this paper we tried to label this data
[86:43] ourselves and we only had like say 67 or
[86:46] 68% accuracy even though we talk like we
[86:48] talk for like 3 hours of how we should
[86:50] be doing labeling really it's
[86:52] complicated it's not an easy task um and
[86:54] here I just showed many different models
[86:56] and um basically you see that models are
[86:58] much cheaper and they can actually get
[87:00] higher agreement with the mode of humans
[87:02] than human humans themselves and the
[87:04] reason why is because humans have a lot
[87:06] of varant models have no varant so they
[87:08] might be a little bit more biased but
[87:09] have less virence uh so it works
[87:12] surprisingly well and now it's kind of
[87:14] the standard in open uh Source Community
[87:16] I think even in Industry a lot of people
[87:18] use both humans and llms for improving
[87:21] uh the colle collection of allf data
[87:24] um and this is like this is the paper
[87:26] from last year but honestly now it's
[87:28] more like that llms would be around this
[87:30] agreement and this cost so around I
[87:32] would say 50x cheaper than humans and
[87:34] better agreement with human than humans
[87:38] themselves okay so that gets us to
[87:41] evaluation of post
[87:43] training um that goes back to your
[87:46] initial question at the beginning of the
[87:47] lecture how do you evaluate something
[87:49] like chpt uh the answers that chpt could
[87:51] give are basically unbounded and it's
[87:54] not that there one right answer there
[87:56] are many answers that are just as good
[87:58] um so there are many challenges one you
[88:01] can't use validation loss because one
[88:05] method might use po the other one might
[88:06] use DPO validation loss is not
[88:08] comparable second you can't use Cal uh
[88:10] sorry perplexity that's the thing I told
[88:12] you before these models uh are not
[88:15] calibrated they don't give distributions
[88:17] they they just optimize for one thing so
[88:19] you can't use perplexity for actually
[88:21] evaluating uh these type of models once
[88:23] they're aligned sorry one Z lined third
[88:27] uh there's a large diversity of
[88:28] questions that human might ask to these
[88:30] models generation open QA like some
[88:33] question answering some summarization
[88:35] and all of these things so there's so
[88:36] many things you have to cover um then
[88:39] the tasks are really open-ended so it's
[88:41] very hard to automate so that's what you
[88:43] were alluding to before so the idea uh
[88:46] is that instead of trying to come up
[88:48] with really easily automated uh
[88:50] benchmarks uh it's just we're going to
[88:52] ask questions that that users actually
[88:54] ask to these models in practice and
[88:56] we're just going to ask annotators to
[88:58] say between these two models which one
[89:01] is better like what's the what's the
[89:02] better output so basically do exact same
[89:04] thing as um basically the data from rhf
[89:08] but you use it now for evaluation yes
[89:10] I'm not sure I understand what you mean
[89:12] by like can't use perplexity and not
[89:13] calibrated right like LM is still doing
[89:16] like next token
[89:18] prediction so I can't so think about um
[89:23] the optim solution after doing PO is
[89:26] basically one model that gives you uh
[89:28] essentially a Delta um like basically
[89:31] says that there's only one sentence that
[89:33] is that could be generated for that
[89:36] question so now if you use it on
[89:38] something that is slightly semantically
[89:39] differently different it would actually
[89:41] give a likelihood of zero for that
[89:43] answer so in reality it's not that
[89:45] extreme because as you say it's still a
[89:47] distribution but I just shows you that
[89:48] there's a there's a fundamental issue
[89:50] with perplexity once these models are
[89:53] not llms anymore they were not trained
[89:56] at least with P they were not trained to
[89:57] to do maximum likelihood anymore they
[89:59] were trained to be
[90:02] policies okay um so probably the most
[90:05] common or like the most um yeah the most
[90:08] common Benchmark or the most trusted one
[90:11] is what we call Chad uh sorry chatbot
[90:12] Arena uh which is basically go on
[90:15] internet have random users on the
[90:17] internet blindly talk with two chat Bots
[90:19] just ask many questions see the two
[90:21] answers and rate which one is better and
[90:24] and you do that over hundred of
[90:25] thousands of users and then you get uh
[90:27] the actual preferences and you get
[90:29] rankings of models uh so you can go
[90:31] right now on chatbot Arena and actually
[90:34] interact with these models um one
[90:36] potential issue just to highlight is
[90:38] that while people who want to do these
[90:40] type of things are usually more like
[90:41] Tech driven um or like techsavvy uh so a
[90:44] lot of the questions that you will ask
[90:46] are more like Tech stuff discussing
[90:47] software errors inquiries about AI tools
[90:50] and all these things um so another issue
[90:53] is cost and speed if you really want to
[90:55] use something like this for development
[90:57] process um it will be too costly because
[90:59] you would need to basically pay a lot of
[91:01] humans to do that so one simple idea is
[91:06] again as we said many times just use LM
[91:08] instead of humans uh you probably know
[91:11] the drill at this point uh steps for
[91:13] every instruction generate outputs by
[91:15] some baseline and the model that you
[91:17] want to evaluate um so here you imagine
[91:20] that I I'm comparing an answer from Chad
[91:22] GPT and from
[91:24] I'm just asking a model uh another model
[91:28] uh which one is better and I just
[91:30] basically average that out uh yeah I
[91:32] asked gp4 which one is better I average
[91:35] that out over my entire distribution
[91:37] over my entire Benchmark or data set and
[91:39] that gives me a RN rate so RN
[91:41] probability for one model compared to
[91:43] another one and now you can rank models
[91:46] uh and this is the Alpa eval uh
[91:49] leaderboard so the benefits of this is
[91:52] that actually we show we get 98%
[91:54] correlation with Chad B Arena so very
[91:56] high correlation with humans um so this
[91:59] is yeah comparison with correlation with
[92:01] other benchmarks and it takes less than
[92:03] three minutes and less than $10 to run
[92:05] so it's pretty cheap um there are
[92:07] downsides though uh one of them is purus
[92:10] correlation um so as we already saw
[92:13] before LMS prefer this is one SP
[92:15] correlation not many I'll just talk
[92:17] about one LMS prefer longer outputs
[92:18] actually humans also prefer longer
[92:20] outputs but the problem or the issue
[92:22] once you use llms is that once there
[92:24] bias you will continue optimizing that
[92:26] humans at some point I can guarantee you
[92:28] if I ask a simple question and you give
[92:29] me five pages of answers I'll be like no
[92:31] I don't like that answer but LMS if they
[92:33] have this bius and they were trained for
[92:34] that they will continue preferring
[92:36] longer outputs so uh here we see um the
[92:41] the preference just showing that like
[92:43] humans and models prefer longer outputs
[92:45] um and here is another view of the
[92:48] initial apaka eval data uh Benchmark
[92:51] where when we asked um when we we rank
[92:54] gp4 when we look at the Run rate of gp4
[92:56] versus actually uh gp4 itself if we com
[93:00] if we use the standard GPT 4 it gets 50%
[93:02] kind of by definition because we're
[93:03] comparing GPT 4 versus gp4 but if we ask
[93:07] a gbd4 to be slightly more verose so we
[93:09] just say in the prompt be Vos in your
[93:11] answers then it gets a r rate of
[93:13] 64.4% so really there's a huge variance
[93:16] and if we ask it to be concise it gets
[93:18] 20% so there's a huge variance depending
[93:20] on um whether you ask it to be concise
[93:23] of
[93:24] that's very annoying um so one possible
[93:27] solution which is what we did is uh just
[93:29] use some regression analysis I'm not
[93:31] going to go into details but basically
[93:33] use Cal inference tools to control for
[93:35] length and right now uh actually length
[93:37] matters much less so if you ask it to be
[93:39] veros we still get some gains but much
[93:43] less great so that's all about post
[93:46] training and now for the next eight
[93:48] minutes I might talk about systems or
[93:50] just answer questions yes can you um go
[93:54] back to your post training in terms of
[93:56] post training how did we tune those
[93:58] parameters using the small body of
[94:01] fine-tuning data and have such big
[94:04] effect on the model you mentioned
[94:05] earlier that there's a different set of
[94:07] hyperparameters are we changing just
[94:10] some of the weights the later weights or
[94:11] all the weights what's actually
[94:13] happening yeah uh yeah I I kind of
[94:15] skimmed through all of this you change
[94:17] all the weights actually um industry
[94:19] would change all the weights in open
[94:21] source land you might have heard of
[94:23] Laura which is going to change basically
[94:26] only some of the weights or it actually
[94:28] to be more specific it's going to add
[94:30] some differences to the output of every
[94:32] of every layer but but in Industry
[94:34] you're going to just fine tune all the
[94:36] weights um and also to say something
[94:39] else about the data actually the SL St
[94:41] all HF you usually going to collect uh a
[94:44] lot more data than with sft so if fft is
[94:46] like 5,000 10,000 maybe 50,000 with rhf
[94:51] I think you're going to be more around
[94:52] like the 1 million
[94:54] uh order of magnitude it's still much
[94:55] less than pre-training though yeah
[94:57] because pre-training is 15 trillion
[94:59] tokens I mean this is like that's not
[95:01] even a drop and yet you influence the
[95:04] weight a lot so because you do it I mean
[95:06] you have to think that how you do it is
[95:08] you use um I mean as I said the learning
[95:12] rate that you're going to use is going
[95:13] to be different but also you only do
[95:15] that so just imagine if I train even if
[95:18] I train on one sentence but over and
[95:20] over again all at some point my model
[95:23] will only that sentence even if uh it
[95:26] was just one sentence instead of the 15
[95:28] trillion tokens so if you use a large
[95:30] enough learning rate and for enough time
[95:32] you will basically overfit that sentence
[95:35] so the the the key thing to to remember
[95:37] is that um the data is not I it's not as
[95:40] if you mix some posttraining data and
[95:42] some pre-training data you do
[95:44] pre-training and then you just start
[95:46] fine-tuning only on the post trining so
[95:48] another way maybe another perspective is
[95:50] that the post the pre-training is just
[95:52] the initialization of your model
[95:54] and once you view it that way that this
[95:55] is just initialization of Weights then
[95:58] there's nothing special like you don't
[96:00] need to remember that you train a lot of
[96:01] data before the only thing that matters
[96:03] is that you had an initialization and
[96:05] now I actually train a model so maybe
[96:06] think about it that way like there's a
[96:08] there's a mark of property in some way
[96:10] just like you had your weights this is
[96:11] my initialization now I'm training that
[96:13] one does that kind of answer your
[96:15] question kind of but you said something
[96:19] just now about it's almost the
[96:21] equivalence of just rerunning the find
[96:23] tuning data many times is it actually is
[96:26] that what actually happens in order to
[96:29] give so much more preference
[96:32] um you might I actually don't know right
[96:36] now how they do it in Industry when we
[96:38] did alpaca we had to do three box so you
[96:40] did run it three times to it
[96:43] um but I mean even the number of times
[96:46] that you run it through it's actually
[96:47] not important the only thing like the
[96:49] only thing is the is kind of the
[96:51] effective learning rate that what
[96:53] matters
[96:54] um so
[96:56] yeah
[96:57] great so I think I have five minutes
[97:02] [Music]
[97:05] right okay I might try to give a high
[97:10] level Overview at least from one of the
[97:12] systems trick systems as we said uh for
[97:17] everyone Bott neck is a sorry compute is
[97:20] the huge bottleneck uh one question you
[97:22] might ask is why not buy more gpus uh
[97:25] gpus are expensive but also are scarce
[97:26] even if you have $10 million right now
[97:28] you cannot buy the best gpus um
[97:32] there's oh yeah there's also some
[97:34] physical limitations when you have when
[97:36] you have multiple gpus you have to
[97:37] communicate between them that takes time
[97:40] um so just buying more gpus is not that
[97:42] easy um so it's really important to
[97:44] think about how do you allocate
[97:46] resources and how do you optimize your
[97:47] pipeline so system 101 on gpus I'm sorry
[97:51] I'm going slightly faster I hope for
[97:53] that some of you at least can follow uh
[97:55] gpus are basically optimized for
[97:57] throughput CPUs are optimized uh for
[98:00] latency so gpus the way you have to
[98:03] think about it is that there's one Comm
[98:04] there's one command that is run on many
[98:07] many Calles at the same time on
[98:08] different type of data um so this is how
[98:12] you see a GPU you see there are many
[98:14] different CES we call them streaming
[98:16] multiprocessors which is very different
[98:18] than the usual CPU architecture so just
[98:20] think High throughput paralyzation for
[98:23] gpus uh gpus are optimized for fast
[98:26] matrix multiplication so every time you
[98:28] will do uh you will do something on GPU
[98:30] if you can do it with a a matrix
[98:32] multiplication it's going to be 10 times
[98:34] faster than with anything else uh that
[98:37] is a little bit annoying because it
[98:38] means that we're kind of uh bottlenecked
[98:40] to doing anything with Matrix
[98:43] multiplications um another thing to note
[98:45] with gpus is that compute has been
[98:47] improving faster than memory and
[98:49] communication so right now gpus usually
[98:53] are hard to keep uh like the data that
[98:56] you send that send to gpus is actually
[98:59] hard to keep up with the processess so
[99:00] most of your gpus are actually going to
[99:02] be idle if you just run normal code if
[99:05] you don't optimize your code so
[99:06] communication and this will continue
[99:09] over time another thing to know about
[99:11] gpus is that there's a memory hierarchy
[99:13] this is the same thing actually with
[99:14] CPUs but basically the closer you are to
[99:17] your cuse the less memory there is but
[99:19] the faster things run if you're further
[99:21] more memory slower
[99:24] um okay I'm going to skip that okay
[99:26] actually I'm going to say it I told you
[99:28] about this uh the fact of communication
[99:31] uh the metric that people usually look
[99:32] at is model flop utilization so what is
[99:34] the theoretical maximum that GPU could
[99:37] run at no more flops that you could use
[99:38] per second divide sorry the number of OB
[99:41] observed through put divided by this
[99:43] theoretical um maximum and in general if
[99:47] you reach 50% you're very happy like
[99:49] Facebook I looked at Lama was at 45 or
[99:51] something like this so that that means
[99:54] that data doesn't come fast enough even
[99:56] for these big
[99:58] companies so one simple trick and that
[100:00] might be the only one I'm going to tell
[100:02] you about is low Precision one simple
[100:05] idea is that well if I'm going to put my
[100:07] floats in lower Precision then there's
[100:09] going to be fewer bits that I have to
[100:11] send to my gpus if there's fewer bits
[100:13] it's faster communication lower memory
[100:15] consumption things are going to go
[100:16] faster uh and for deep learning it just
[100:19] happens that de decimal is not that
[100:21] important uh so so when you do matrix
[100:24] multiplication when you do like for
[100:26] example SGD there's already so much
[100:27] noise that if you update something by
[100:29] 0.01 or
[100:31] 0.015 who cares uh so basically instead
[100:34] of using uh 32 bits per float which is
[100:38] um what people used to use or 64 for
[100:41] example which is what you would use in
[100:42] other domains you use 16 bits uh for
[100:45] matrix multiplication so for every float
[100:47] you use 16 bits um and for training you
[100:50] have this type of like uh what we call
[100:53] aut atic mix Precision which is that uh
[100:55] some of the things are in 32 bits others
[100:57] are in 60 bit in 16 bits um generally
[101:00] the way you should be thinking about it
[101:02] is that your weights are stored of your
[101:04] model are stored in 32 bits um but just
[101:07] before the computation you put
[101:08] everything in 16 16 bits like this you
[101:10] do computation super fast and at the end
[101:13] you update your weights in 32 Bits And
[101:16] the reason why you do all the updates in
[101:17] 32 bits it's just think that if your
[101:19] learning rate for example is very small
[101:21] you still want to be able to like make a
[101:23] difference in your weights uh so all the
[101:25] computation is done in 16 bits but the
[101:28] weights are actually stored in 32 bits
[101:30] so that's like the standard way that
[101:32] people are doing it um okay I'll
[101:35] actually talk just about this and then
[101:37] I'll skip all the rest operator Fusion
[101:38] because I think this is actually pretty
[101:39] cool as I just said communication is
[101:41] very slow and actually every time you
[101:44] use a pie torch line it basically moves
[101:46] variable to Global memory of your GPU so
[101:49] when you have something like this x do
[101:52] cosine uh equal X1 and then you do X1 do
[101:55] cosine what is happening behind the
[101:57] scenes is that you take the X which is
[101:59] data you ship it to your um to your
[102:02] actual processes of your gpus you apply
[102:04] the coign you ship it back to the main
[102:06] memory of your GPU and then you see the
[102:08] next sign you ship it back to the
[102:10] computer to the GPU processor you apply
[102:12] another cosign and you ship it back
[102:14] again um so another way to see that is
[102:17] that you go from your Dam which is your
[102:18] Global memory in your GPU and you ship
[102:21] it to compute you ship it back for every
[102:23] line This is a naive way of doing it
[102:25] this seems very wasteful um so the idea
[102:29] simple idea of operative Fusion is just
[102:32] communicate do all the computation ship
[102:34] it back once and this is exactly what
[102:37] fuse kernels are um so if you ever want
[102:40] to make your comp your computations in
[102:44] pytorch much faster just apply torch.
[102:47] compile on your model this is going to
[102:49] make your model around two times faster
[102:52] and what it does is simply that it
[102:53] rewrites your code uh your P like your
[102:57] py torch code basically in C++ in Cuda
[103:01] uh to to do the communication only once
[103:04] then do all the operations then uh ship
[103:06] it back okay I'm not going to have time
[103:09] to talk about tiling tiling is important
[103:11] paration paration is important um and
[103:16] mixture of experts mixture of experts is
[103:18] important Outlook there are many things
[103:20] we haven't T talked about we haven't
[103:23] talked about architectures we definitely
[103:25] haven't talked about inference um there
[103:27] are many other things that are important
[103:29] with LMS what is the UI that you use I
[103:31] mean arguably chat jpt the big novelty
[103:33] was just have a simple UI to use it
[103:35] multimodality what are all the misuses
[103:37] you could have uh the fact that there
[103:39] might not be enough data on the internet
[103:41] to train all these models legality of
[103:43] data collection so many other things if
[103:45] you are interested in all these topics
[103:47] uh I would suggest three classes cs224n
[103:50] is probably the one that touches the
[103:52] least on uh LMS uh but it gives some
[103:55] background and historical context um of
[103:58] all the LMS and gives kind of some
[104:00] adjacent material CS 324 I think it's
[104:03] called Uh I think it's just called large
[104:06] language models uh more in-depth reading
[104:08] and lectures on everything I talked
[104:09] about CS 336 which is large language
[104:12] model from scratch you actually build
[104:14] your own llm uh it's an amazing class
[104:18] also given by my two supervisors very
[104:20] heavy workload so be careful and um
[104:23] great