[0:05] hey guys welcome to the our last lecture [0:08] um of this quarter and we're very happy [0:12] to have a daa here he's a the CEO of [0:16] contextual AI um the Enterprise llm [0:19] company as well as an Adjunct professor [0:22] in symbolic systems here at Stanford and [0:25] previously he was the head of research [0:26] at hooking face and before that a [0:28] research scientist Facebook AI research [0:32] uh he received his PhD and masters from [0:34] the University of Cambridge um as well [0:36] as a master in logic from the University [0:38] of Amsterdam and studied philosophy and [0:40] cognitive AI in undergrad um and his [0:43] work focuses on machine learning as well [0:45] as NLP specifically on developing better [0:48] models for language understanding and [0:50] generation and better tools for [0:52] evaluation and Ben yeah give it up for [0:56] Adella [0:58] right thank you so I guess I have to [1:00] sort of stand here in the corner so [1:02] people can see me on the zoom as well um [1:06] yeah thanks so much for having me here [1:09] um so I asked Steph what I should talk [1:11] about there were a couple of things I [1:13] could talk about multimodality or [1:15] evaluation uh and this was the preferred [1:18] topic I guess uh because the others were [1:20] already covered um so yeah I'm I'm very [1:24] happy to talk to you about everything [1:25] retrieval augmentation um I think this [1:28] is really one of the topics right now in [1:31] our field um so I I'll just give you an [1:34] overview of what's been happening and [1:36] what I think are the interesting [1:37] questions to think about um so first of [1:41] all obviously in case you've missed it [1:43] we are in the age of language models um [1:46] and I just wanted to do a quick poll [1:48] here in this not not super big audience [1:51] I guess there's more people on the zoom [1:52] but uh who invented language [1:55] models if if you thought open AI then [1:58] I'm angry with you right so so actually [2:01] uh this is a very very old idea so the [2:04] idea is just you you take a sequence and [2:06] you factorize out the token [2:07] probabilities right and um so it wasn't [2:11] invented by open AI it's not like a few [2:14] years old it's actually several decades [2:16] old uh so I'm bringing this up because I [2:19] was talking to someone and they were [2:20] like open AI invented language models [2:22] and I was like you're kidding me right [2:24] um so um I I I went back to the [2:27] literature and this is the oldest one I [2:29] could find actually 1991 first neural [2:31] language model um there's a very nice [2:34] paper from 2003 from [2:36] pjo where they they actually have like [2:38] word embeddings and everything already [2:40] in there uh so obviously these are LMS [2:43] not llms um and as it turns out if you [2:46] make them really big and you [2:48] parameterize them with these massive uh [2:50] neural Nets then you get something [2:52] really powerful that really shows [2:53] emerging uh properties right and that's [2:55] why we're all excited in this stuff um [2:59] so if we think about this from like a [3:01] classic CS perspective there's input [3:03] output right there's this kind of thing [3:05] in the middle it's the generator so uh [3:07] we take a sequence the input sequence [3:10] and then the the task of the model is to [3:12] predict the next token very very simple [3:15] model um and and so you know that's why [3:18] it was so easy to come up with this in [3:20] 1991 already because it's like the idea [3:22] is very intuitive but for a long time [3:25] what was really broken with this was the [3:27] user interface um and and this I think a [3:31] lot of people kind of misunderstand what [3:33] Chad gbt was about that's really what [3:35] Chad gbt fixed right so that in [3:38] initially you had to come up with these [3:39] very weird prompts in order to get your [3:41] language model to do what you wanted it [3:43] to do uh and humans are terrible at this [3:46] right so so we're much better at sort of [3:48] telling people or things around us what [3:50] we want right so if we have a dog we say [3:52] sit we don't prompt it in a very weird [3:54] way so that it sits right and it's the [3:57] same with the language model if you [3:58] wanted to generate some R lyrics in the [4:01] style of a pirate or Shakespeare or [4:03] something then you tell it generate some [4:05] R lyrics in the style of a pirate right [4:07] so that kind of instruction data [4:10] actually turns out to be super super [4:12] rare in just web data so what you need [4:14] to do is you need to fix the user [4:16] interface to the language model and the [4:18] the classic recipe for doing that is the [4:21] the sequence basically that chat gbt [4:22] used right so you promp the model in a [4:24] specific way you instruction find in the [4:26] model and do you do some alignment rhf [4:29] uh whatever you do on top of that so [4:31] that's the first thing so now you have a [4:33] working language model with a working [4:36] user interface so are we done then um [4:40] obviously we're not right so so right [4:42] now language models are are kind of [4:43] taking the World by storm but if you [4:45] talk to anyone especially in an [4:47] Enterprise for example where they have [4:48] very strict uh accuracy requirements [4:51] they will tell you that they can't [4:53] really productionize this yet um and the [4:55] reason is because there are all these [4:57] familiar problems probably a bunch of [4:58] you are working on these problems right [5:00] now uh around [5:02] hallucination um so these models they [5:04] kind of make up stuff very often with [5:05] very high confidence which is uh even [5:08] more scary in a way attribution so we [5:11] don't really know why these models are [5:12] saying what they're saying Stillness [5:15] they go out of date and so this was a [5:17] big problem with sort of chat GPT not [5:19] knowing anything that happened after a [5:20] certain cut off date and they keep [5:22] updating it every once in a while but [5:24] you want to have a system that's always [5:25] completely up to date that never goes [5:27] still um you want to be able to to [5:29] revise the information in the system so [5:32] uh if you're uh a European organization [5:34] you have to worry about gdpr uh which [5:36] means that you need to be able to remove [5:38] information from the language model or [5:40] maybe Revis facts uh which we don't [5:42] really know how to do right so again [5:44] this is a very interesting uh area of [5:46] study for a lot of folks model editing [5:49] um but so this is something that we [5:51] really want to be able to fix and then [5:53] there's this big question of how do you [5:55] customize these models uh so different [5:58] people have different use cases you have [6:00] different data if you're a company or if [6:01] you want to have a language model on [6:03] your own data how do you make it work on [6:05] your own data so one of the solutions uh [6:08] that everybody has started using right [6:11] now is to couple it to an external [6:12] memory so that's really just rag right [6:15] the uh we we can this whole lecture is [6:17] basically about rag uh but the way to [6:20] understand uh what is going on here is [6:23] uh we have this generator just like [6:25] before we have the input and a prom just [6:27] like before but now uh instead of just [6:29] those two things we give this additional [6:32] context so we contextualize the language [6:34] model using things we retrieve and and [6:37] the retriever uh is is very often pretty [6:40] simple it's just a query in a documents [6:42] encoder um and then you get a bunch of [6:44] documents you give them as context [6:46] through the model so super simple [6:49] architecture um and I think it's useful [6:53] to think about it from the perspective [6:54] of of these two separate paradigms uh so [6:57] if you've ever taken an exam I'm sure [6:59] you have right uh you can have a close [7:01] book exam where you have to memorize all [7:03] of this so you have to cram all the [7:04] knowledge into your parameters your [7:07] neurons uh or you have an open book exam [7:09] where you have all of this information [7:11] in the book that you can access when you [7:12] do the exam uh so it's a very similar [7:15] thing with rag right you can just make [7:17] it an open book setting where you give [7:18] it access to this external information [7:21] Wikipedia or something else or basically [7:23] the entire internet uh and then have the [7:26] language model do its job without having [7:27] to memorize all of it in it [7:30] parameters um so the other I think [7:32] useful distinction here is that uh [7:35] cramming everything into your parameters [7:36] that's the parametric approach right so [7:39] U what we're doing with rag is we're [7:40] adding this non-parametric retrieval [7:43] component um so uh you might call this [7:45] semi- parametric um if you want to give [7:48] this a [7:49] name all right so why why does that [7:52] actually solve these issues and so the [7:55] answer is basically that if you have [7:57] this separate Index right this separate [7:59] retriever you can swap it in you can [8:01] swap it out you can replace it with a [8:03] new index so you can really customize it [8:06] and so you can customize your language [8:07] model system for what the user really [8:10] wants to see um and then obviously you [8:13] can update this index so um it doesn't [8:15] really go still and you can revise it if [8:17] everything goes wrong if anything goes [8:20] wrong uh the other thing you get is [8:22] grounding right so that that's initially [8:24] why I became interested in this kind of [8:26] architecture because I was thinking a [8:27] lot about grounding and multimodal and [8:29] things like that and actually one really [8:31] nice way to ground things is to find [8:33] some other information that you can [8:35] ground your generation in so you really [8:37] want the language model to only say [8:39] things that it has evidence for in this [8:41] outer piece of text or even multimodal [8:44] data that it retriev separately so if [8:46] you do that then you get less [8:47] hallucination because you can always [8:49] point back to your Source it's always [8:50] grounded in your Source um and you get [8:53] attribution because you know why the [8:54] model is saying what it's saying it's [8:56] because it founded this thing here is [8:59] that [9:02] all right so um for the rest of this [9:05] lecture we're going to talk about this [9:06] this basic architecture um and so it [9:10] kind of looks like a pretty simple thing [9:12] right uh but there are actually lots and [9:13] lots of questions you can ask about what [9:16] what this system should really look like [9:18] um and like this this doesn't even cover [9:20] like half the questions you can ask so [9:23] it really is about how how do we [9:25] optimize this entire system right so we [9:28] have the separate components the [9:29] retriever the generator and then um [9:32] there are things like this query encoder [9:34] how do we encode queries how do we uh do [9:37] the retrieval do we update the document [9:39] encoder how do we actually uh Define a [9:42] document right is it like a full [9:44] document or is it a paragraph or a chunk [9:46] or a sentence or a couple of words um so [9:48] there are lots of questions to ask and [9:51] and uh as you'll see there are lots of [9:53] possible answers to these questions as [9:55] well um so this is what we'll we'll [9:58] cover [10:00] um so there are lots of [10:03] architectures um going into these [10:05] questions and I think as we go through [10:07] them it's useful for you to think about [10:10] what happens during training time and [10:12] what happens during test time right so [10:14] during training time is really uh okay [10:16] we have this language model we have this [10:18] retriever um which one do we update how [10:21] do we update them how do we train this [10:23] entire system do we maybe not train it [10:25] at all uh do we pre-train it from [10:28] scratch do we initially I it with uh [10:30] components that were already separately [10:32] trained these are the kinds of questions [10:34] that that you have to answer if you want [10:35] to design a system like this and then [10:38] during test time uh you have this entire [10:41] system right so actually multiple models [10:43] in a way uh that are working together um [10:47] so so there's also different things you [10:48] can do there right so give it different [10:50] indices during test time or uh [10:52] manipulate kind of how you're sampling [10:54] things like [10:55] that so um the starting point for all of [10:59] this stuff I think if you ask someone [11:00] now like what is rag they will think of [11:02] this thing um so this is frozen rag [11:06] basically uh there's no training here at [11:09] all so going back to this question of [11:10] train time test time there's only test [11:12] time here train time happen separately [11:14] with these kind of blackbox models that [11:16] we don't necessarily have control over [11:18] right so there's this document embedding [11:20] model uh whatever is currently at the [11:23] top of some open source uh leaderboard [11:26] uh you use that to oop sorry uh to get [11:29] some vectors that you then use to create [11:32] this Vector database and then the vector [11:34] database just does search and it gives [11:36] the information from the search to the [11:38] language model and it just passes it as [11:41] uh as the context right so this is this [11:44] only works because of in context [11:46] learning um and you know I think as a as [11:50] a machine learner myself this feels very [11:52] inelegant um so what what this lecture [11:55] is about is can we do better than than [11:57] this Frozen [11:59] thing um so let's let's start from the [12:03] the left side of this like okay if we [12:05] want to outperform this Frozen thing [12:07] itself with just the vector database [12:09] like what would that look like from a [12:11] retrieval [12:12] perspective um and the starting point [12:15] for everything retrieval is is tfidf [12:17] does everybody know what tfidf is no [12:22] okay so so tfidf is basically a sparse [12:25] retrieval method where you have a score [12:27] function uh that that looks at documents [12:30] and queries so D and Q and then there [12:33] are basically two terms that matter one [12:35] is the TF the term frequency and the [12:37] other is the IDF the inverse document [12:39] frequency so this inverse document [12:41] frequency is actually a really nice idea [12:43] from Karen spark Jones really underrated [12:45] researcher she's done some amazing work [12:48] um but the basic idea is that you want [12:51] to look at the words that are very [12:53] special so that don't occur in lots of [12:54] different documents and so the overlap [12:56] between the word the doesn't really [12:58] matter matter right like the occurs [13:00] everywhere so you want to have sort of [13:02] the special words so that's what what [13:04] tfidf does in a nutshell it gives you a [13:06] score for document query overlap and [13:10] then you can do all kinds of things here [13:12] with how how you weigh it so there's all [13:14] these weird different parameters like [13:15] this B and things like that that allow [13:18] you to make it better than just having [13:20] the the tfidf score so there's a couple [13:22] of tweaks you can do there so bm25 [13:25] actually in case you're wondering stands [13:27] for best match 2 [13:29] so I I try to discover like where does [13:31] the 25 actually come from uh that's [13:34] because the the prior s the preceding 24 [13:37] experiments failed right so it's [13:39] literally the 25th one that seemed to [13:41] work and that's why it's called [13:42] bm25 it's bizarre right but um um so so [13:46] this is spars retrieval it's just [13:48] counting words right so you have this [13:50] massive massive Vector of all these word [13:53] occurrences it's sparse because most [13:55] words never occur right so it's sort of [13:57] like a vector of uh vocabulary size [14:01] dimensions so most of that is obviously [14:03] zero um but so that's actually kind of a [14:06] nice property if you want to do fast [14:08] search on a CPU right because on a CPU [14:10] sparse uh do product is very easy to [14:13] compute so um this is used in in the [14:16] system called uh Dr QA which is really [14:19] one of the first neural instances of [14:22] this open domain sort of open book [14:24] question answering Paradigm um so you [14:27] have a question like how many of [14:29] warsaw's inhabitants blah blah uh so you [14:32] want to ask basically Wikipedia what the [14:34] answer is for this so then you have this [14:36] document retriever based on the sparse [14:38] so bm25 I think in this case uh [14:41] retrieval methods you pass that to um at [14:44] this I think this was still by lsdm at [14:47] the time um a document reader model and [14:50] then that model gives you the answer um [14:54] so this I think is really the first [14:56] instance of having sort of this [14:57] separation between a retrieval and a [14:59] generator system that you use for [15:02] answering complicated questions based on [15:03] sort of open domain [15:05] knowledge um so after The Spar stuff um [15:10] there was a bunch of work on dense [15:11] retrieval and and so the advantage of [15:14] dense retrieval so this is just like [15:16] word and Benes basically vectors right [15:18] they're they're dense now no longer [15:19] sparse so they're much uh smaller in [15:22] terms of dimensionality and the nice [15:24] advantage of of dense retrieval is that [15:27] it's not really about specific work [15:28] right so uh if there're synonyms you can [15:31] still um find the relevant document uh [15:35] which you couldn't really do with a [15:36] sparse representation right so that's [15:38] really the advantage of DSE is that you [15:40] get like semantic [15:41] similarity um so you can do this over [15:45] word embeddings that doesn't really work [15:46] all that well but uh at the time that [15:49] people started thinking about this ber [15:50] was already out there and ber is really [15:52] great for giving you a vector [15:53] representation for an entire sequence of [15:55] words right so a sentence representation [15:57] or a passage representation [15:59] so there are all these cool systems like [16:01] Orca and uh DPR the dense passage [16:04] retriever where um they essentially use [16:08] the retrieval as a kind of latent [16:09] variable in the system U and and the way [16:12] to get the latent variable to to work to [16:14] be good enough essentially to train the [16:16] entire system is to pre-train the [16:19] retriever on uh relevant information so [16:21] for Ora they do something called inverse [16:24] close uh so they do kind of a close task [16:27] where you want to find [16:29] um passages that are sort of relevant to [16:31] the preceding passage and in DPR they [16:34] just train it on on a supervised thing [16:36] but really the core idea here is that uh [16:38] as you can see in this graph here you [16:40] can do better than bm25 if you add lots [16:43] of documents and the way you compute [16:45] this score function is much simpler it's [16:47] just a d [16:48] product right um so the nice thing about [16:52] D products is that you can do them very [16:55] very efficiently on the GPU as well um [16:58] if you uh know what you're doing so what [17:01] you really want to get at is maximum in [17:04] product search mips right this is one of [17:05] the kind of core ideas of a lot of this [17:07] stuff um and you can do mips with Ann [17:12] approximate near neighbor search um and [17:14] so there's this this really uh brilliant [17:17] piece of work out of there for my [17:19] colleagues at the time uh called phas [17:22] which really underlies all of these uh [17:24] modern Vector databases right so like [17:27] all the popular ones they sort of [17:28] re-implementations of this face idea one [17:30] is in like rust one is in go but it's [17:32] all basically the same idea it's just [17:34] face um and so so face really Powers a [17:37] lot of this stuff um and whenever [17:40] somebody tells you something about a [17:41] vector database just think about face [17:44] very fast do [17:46] product um so obviously you can go [17:49] beyond do product yes what is it what is [17:53] face um so so it's an open source [17:56] Library Facebook AI similar [18:02] search no so it's just basic off the [18:04] shelf Ann [18:09] algorithms yeah so so there are all [18:12] kinds of different I don't know if you [18:13] do you know what like product [18:14] quantization is and things like that so [18:17] there they're basically so you have a [18:18] bunch of vectors uh and you can just [18:21] compute the full dot product which is [18:23] sort of inefficient right so what you [18:25] can do is try to compress uh subspaces [18:28] of the vector and then just look at the [18:30] kind of [18:31] centroids um so this so you can quantize [18:34] sub vectors of the full vector and then [18:36] do much faster search over just the [18:41] centroids it's good question any other [18:46] questions um all right so so about this [18:49] dot product idea right so so what we [18:52] have here is some people call this a [18:54] Siamese Network I guess it is right so [18:56] you have two different bir models uh or [18:59] whatever your encoder is here and then [19:00] at the end you get these two vectors and [19:02] then you just do do product so you get [19:04] one single score but you can do all [19:06] kinds of much fancier things if you if [19:08] you're willing to give up on this buy [19:10] encoder uh approach right um so really [19:13] nice example from from one of your [19:15] colleagues here at Stanford uh is [19:17] Colbert um so what this does is is late [19:21] interaction uh so so instead of just [19:24] having this dot product here you have a [19:26] kind of more complicated uh [19:28] version of computing a score where you [19:30] aggregate over sort of Maximum [19:32] similarity scores between different [19:34] words so I only recently actually [19:36] discovered that this is called Colberg [19:38] because of the late night show Colberg [19:40] so it's sort of Omar's joke actually [19:43] this name but just just so you know if [19:45] you run into it um so um but but I think [19:51] if if we look at kind of where the [19:52] state-of-the-art has has been going now [19:55] one of the nice things about these [19:56] Vector databases is that they're super [19:58] efficient right so dot product is much [20:00] more efficient than this late [20:01] interaction stuff especially if you do [20:03] the approximate nearest neighbor search [20:05] um but there's been some really cool [20:07] work so things like Spade uh they [20:11] basically have have sparse meat dents in [20:14] a way so one of the big problems as I [20:15] said with spars is that you can't really [20:17] handle synonyms and things like that but [20:19] what you could do is take a dense model [20:22] Like a Bird model look at kind of this [20:24] this one word in your sequence try to [20:27] see which other words in the same slot [20:29] so that gives you the synonyms uh so now [20:32] you can give all these synonyms to a [20:34] sparse uh vector and then you can just [20:36] do Spar doll product and so have a much [20:39] much more efficient way to do search uh [20:42] without sort of giving up on all the the [20:44] cool stuff that you get from a dense [20:46] representation um so that's one thing [20:49] and this other idea I really like uh is [20:51] called Dragon um so this I think is [20:54] really the the the best generalized D [20:57] dense retriever so if you want to take [20:58] something off the shelf right now and [20:59] just go to hugging face or something [21:01] then this dragon or Dragon plus is [21:03] probably the thing you want to use for a [21:05] dense Retriever and the way they train [21:07] this is is through this Progressive data [21:10] augmentation strategy to make them the [21:12] model better and better over time by [21:13] sampling very difficult negatives um and [21:16] that gives you very good uh [21:19] representations um and and so the other [21:21] thing about this I think this is the [21:23] only only sort of final point about uh [21:26] retrieval in general is that is that [21:27] what we see happening right now if you [21:29] look at sort of the developer Community [21:31] around rag is that they're all doing [21:32] hybrid search right now uh so you can [21:35] actually just combine the search results [21:37] from your sparse bm25 or whatever thing [21:40] or spade and you can combine them with [21:42] your dragon uh and then you get uh this [21:45] ranking that works even better uh so [21:47] then you kind of get Best of Both Worlds [21:48] but then you get all these questions [21:50] about how do you combine the [21:52] results um any any questions on on this [21:56] part oh can you hear me [21:59] yes oh sorry um on the earlier slide uh [22:02] was there has there been any work on um [22:04] Benchmark how much less hallucination [22:07] rag incurs over a closed book question [22:10] answering for example directly asking [22:12] the large language model the question [22:14] has there been any benchmarking studies [22:16] in this yeah so there there's a great [22:18] paper if I can say so myself on the fact [22:21] that retrieval augmentation reduces [22:23] hallucination uh it's from 2021 I think [22:26] um so so yeah you can just F find if you [22:29] literally look for retrieval [22:30] augmentation reduces hallucination then [22:32] you'll find the paper uh thank [22:43] you yeah so so uh very often you want to [22:47] have um an very precise word overlap for [22:51] things where you don't want to have the [22:53] synonyms or the kind of nearest [22:54] neighbors right so um if there's like a [22:57] brand name name or or something like [22:59] that then like let's say the brand is [23:01] apple right you don't want to find stuff [23:03] about pairs right so that's what you [23:05] would do with a dense retriever um so so [23:08] it really kind of depends on what you [23:11] want to use it for that's why hybrid is [23:13] probably the way to [23:14] go it's a good [23:17] question with the [23:19] dance it's [23:21] um it's contextualized that but [23:24] shouldn't it realize Apple the company [23:26] would be different from no so so if they [23:29] were actually contextualized then yes [23:31] but but very often it's a a frozen [23:33] retrieval system right that's one of the [23:35] problems with all the Frozen rag [23:41] stuff I might be missing very [23:44] B refering to the factors that [23:48] you're factors that you're using is [23:52] or uh no so so the the the the sort of [23:58] document and the query that they're the [24:00] same right so they're either sparse or [24:02] they're dense but so if they're sparse [24:04] the components of the vector are are [24:06] literally the other [24:09] work you just Oneal when [24:12] you're the thing that [24:16] creates uh how are you getting so it's [24:20] literally counts right so so basically [24:23] it's a one big Matrix of documents as [24:26] rows and the columns are the words in [24:28] the documents and then you just count [24:30] how often a word occurs in a document [24:33] right so that's as [24:35] far also [24:39] refering yeah and so so in the field we [24:42] call them sparse sparse embeddings or [24:45] sparse retrieval because most of that [24:47] Vector is zero right because most wordss [24:50] don't occur in that [24:53] document does that make sense [24:56] yeah [24:58] cool um so um let's talk about uh doing [25:04] slightly better so so going back to [25:05] Stephen's question about okay we we have [25:07] this kind of retrieval thing but like [25:09] how do we actually make this retriever [25:11] good for the context that is going to be [25:13] used in right so can we contextualize [25:15] the retriever for the generator uh even [25:18] if it's it's a generator where we might [25:20] not have access to the weights so it [25:22] could be a gp4 model we just send it to [25:24] some API we get some stuff back um [25:28] and so uh one paper I really like is [25:30] called replug um so just just to kind of [25:33] explain what this looks like so you have [25:35] this context you have a retriever that [25:37] we do the the standard retrieval set [25:39] with this is a dense retriever um and [25:42] now sorry um and now you uh compute the [25:45] the likelihood so basically just [25:47] normalize the scores that you get for [25:49] for the topk documents to get a [25:52] distribution here and then uh you give [25:54] each one of the retrieve documents [25:57] separately to this generator to your [25:59] language model so you can look at the [26:02] perplexity of the correct answer for [26:04] that language model right so now we have [26:06] these two probability distributions or [26:08] two likelihoods essentially and we can [26:10] minimize the KL Divergence to make sure [26:13] that we can actually uh retrieve the [26:15] documents that lead to the lowest [26:17] perplexity on the right answer for the [26:19] language model um so super simple idea [26:23] uh works really really well uh and the [26:26] nice thing about this is is completely [26:28] uh agnostic of what happens Upstream [26:30] right so this will work for any sort of [26:32] encoder decoder for any language model [26:35] um what what you need is a perplexity [26:38] score uh but for most language models [26:40] you can get that not necessarily all of [26:42] them so that's one thing and then [26:44] there's this other really nice approach [26:47] um what you what parameters are you [26:50] changing so so in the retriever you're [26:53] you're literally updating the uh the the [26:56] dense representations [26:58] right so your encoder basically for your [27:00] dense representation that's good [27:01] question we'll get more um so there's [27:05] this another paper uh on in context [27:07] retrieval augmented language models [27:09] where the whole paper is basically about [27:12] just doing bm25 and just giving stuff [27:15] directly to the context of the language [27:16] model and things kind of work so it's [27:18] it's sort of Frozen rag but even even [27:21] more primitive in a way where the the [27:23] retriever is uh this very old sparse [27:26] algorithm but it works really really [27:27] well um but then they have this really [27:30] awesome section where they they show [27:32] that you can just have this uh ranker on [27:35] top of the bm25 results um and you can [27:38] backdrop into this ranker so now you [27:40] still keep the language model completely [27:42] fixed uh so that's sort of this part of [27:45] the the loss here uh so you have kind of [27:47] a stop gradient on the parameters data [27:49] that's just your language model but now [27:51] you have this uh this kind of rank [27:54] function here that you can back propop [27:56] into right so that's your ranker is [27:58] basically can be a bir model or anything [28:00] like that that works on top of the [28:01] things you initially retrieve from your [28:03] bm25 and now you have this bir reer [28:05] ranker that you can backrop into um so [28:09] this also works really really nice so [28:11] we're slowly progressing towards having [28:13] a system that is much more optimized for [28:16] being properly uh retrieval augmented in [28:19] a way where it's useful and and [28:20] contextualized for what you want to use [28:22] it [28:23] for um so uh yeah just to point out kind [28:26] of what that looks like with this ranker [28:28] so you just have this extra step [28:29] essentially right so we have our [28:31] retriever then we have our ranker then [28:33] we have our generator and our [28:38] output no not [28:41] necessarily um so so so for this one you [28:44] do yeah but so for replug you don't [28:47] right yeah yeah yeah yeah yeah so [28:52] basically yeah you need to get do apis [28:54] provide not all of them um some of them [28:57] do right but but yeah there are all [28:59] kinds of tricks you can do on top of [29:01] that [29:02] yeah um so [29:04] so basically the question is how do we [29:07] get sort of gradients flowing into this [29:09] right so if you don't actually have [29:10] access to the full parameters of model [29:13] so that you can backrop all the way [29:14] through it then you can uh do a [29:17] reinforce style loss on on the retrieval [29:20] and then you just pass the kind of log [29:22] likelihood if you if you have access to [29:23] that or some other kind of blackbox [29:26] function [29:31] all right so um I the next thing you can [29:35] do uh is to optimize both the Retriever [29:38] and the generator um and and so this [29:41] really uh start starts getting to the [29:43] the proper kind of contextualization of [29:45] the entire architecture where you want [29:47] everything to work together right so [29:49] rather than having this Frozen thing [29:50] where everything is basically not aware [29:52] that the other part exists right it's [29:54] like two halves of the brain they're not [29:55] talking to each other one is your [29:57] retriever that is your language model [29:58] there's no connection they're just like [30:00] sort of like something is thrown over [30:01] the fence and then you hope for the best [30:03] uh so instead of that we have everything [30:05] much closer and learning together um so [30:09] um one of the the first um ways of doing [30:13] this with a generator uh was rag [30:15] retrieval augmented generation uh which [30:17] we did at ver in 2020 um and and it's [30:22] very similar to what we've already seen [30:23] we basically have this retriever here [30:25] that works over different documents you [30:27] get some score function uh that gets [30:29] given to this generator um that that [30:32] generates answer and now you want to [30:34] backdrop all the way and update your [30:36] generator as well right so in the [30:38] previous two architectures we saw you [30:40] keep the generator fixed you backdrop [30:42] into your retriever but here we update [30:45] everything well not exactly everything [30:47] as you'll see but we'll we'll also [30:49] update the the part of the Retriever and [30:52] the [30:53] generator um so in this rag model uh we [30:56] actually have two different ways of [30:58] doing this and this this is probably [31:00] something that when we talk about this [31:03] uh if you think about this long enough [31:04] then you'll you'll think like okay but [31:06] when actually do I need to retrieve like [31:08] do I do I retrieve every time I generate [31:11] a new token or do I just retrieve once [31:13] and then generate an entire sequence [31:16] right or maybe I want to retrieve every [31:18] end uh tokens right so these are hyper [31:21] prams or maybe I want to learn when to [31:22] retreat as as we'll see that's also [31:24] something people have done um so are are [31:27] two different ways to do it um and and [31:30] what we do in this paper basic the whole [31:32] point of the paper is that this Frozen [31:34] thing doesn't really work all that well [31:37] right so I think what people Call Rag [31:39] now is is usually refer refers to the [31:42] Frozen thing uh but the whole paper [31:44] basically would never have been accepted [31:46] anywhere if we had just done the Frozen [31:47] thing right the whole point of the paper [31:49] is that you want to uh optimize it and [31:52] so at my company contextual we call this [31:55] Frozen thing Frankenstein's monster [31:57] because it's really like you Cobble [31:58] together these different pieces right [32:00] you sort of yeah it's it's really like [32:02] Frankenstein you just put it together [32:04] and then it sort of walks you know uh [32:05] but it doesn't really have a soul it [32:07] doesn't really actually work it's not [32:08] the real thing um so that's great for [32:12] for everyone here I think because there [32:14] are so many opportunities to do better [32:15] than what what most people are using [32:17] right [32:18] now um so one of the limitations of of [32:22] the original rag architecture is that it [32:25] only supports a very small okay but so [32:28] if you have lots and lots of documents [32:30] uh then the problem is that you have to [32:32] fit all of them in the context but how [32:34] do you really get that uh to fit right [32:38] so one thing you can do is you you first [32:41] encode uh things so that you get one [32:43] single representation or only the few s [32:46] of top level representations then you [32:48] concatenate those and then you just feed [32:50] them to the decoder so this is FID [32:52] Fusion in decoder um and as you can see [32:55] the scales to a much higher uh number of [32:58] of passages uh and that uh leads to [33:01] corresponding improvements in uh the [33:04] scores that you care [33:06] about uh so that's a really cool idea [33:08] and so so we're we're slowly moving [33:10] towards more decoder only architectures [33:13] right so in rag we have this bark model [33:15] it's sort of an encoder decoder [33:16] architecture but here you just have this [33:18] decoder that does some fancy attention [33:21] over stuff that you retrieved before um [33:24] and and so another like pure decoder [33:28] language model architecture um is this [33:31] one [33:32] KLM which I think is is very elegant in [33:35] its simplicity so it's basically you [33:37] just have a normal language model but uh [33:40] you interpolate the normal language [33:42] model weights with uh things that you [33:45] retrieved um so basically you have some [33:48] sort of prompt right so like Obama's [33:50] birthplace is you go to your big Corpus [33:52] you find similar things you look at the [33:55] words that come next to the similar [33:57] things uh you uh rank that thing you [34:00] sample your top K you renormalize that [34:03] so now you have a bunch of scores and [34:05] now you can just interpolate between [34:07] your retrieved kind of non-parametric [34:10] memory scores and your parametric [34:12] language model scores so this is very [34:14] late Fusion in a sense right you at the [34:16] very end you combine these two uh and it [34:18] allows you to re reweight the pure [34:20] language model probabilities or [34:22] likelihoods um so this works really well [34:25] and it scills especially well if you [34:27] have a huge uh retrieval Corpus so if [34:30] you have trillions and trillions of [34:32] tokens in there you could have a much [34:34] smaller language model that does not [34:36] that much heavy lifting because you can [34:37] really rely on this big Source Corpus [34:40] that you're working from and so that [34:42] idea was uh exploited by this paper [34:45] called retro out of Deep Mind where uh [34:49] they showed that you can have a 25 times [34:51] smaller retrieval augmented language [34:53] model trained from scratch so really [34:55] pre-trained uh entirely from stretch [34:57] that outperforms this 25 times bigger uh [35:00] language model on the same data in terms [35:02] of perplexity which is pretty impressive [35:05] right so this architecture is much more [35:06] efficient than a parametric model [35:09] because you can rely on this external [35:11] memory so if your external memory is big [35:13] enough uh you can get pretty huge gains [35:17] so there was a lot of excitement about [35:19] retro when it was announced uh but it's [35:21] a deep mind paper so there's really no [35:23] open source nothing really to validate [35:26] that this actually Works um and so very [35:29] recently there has been a bit of work [35:31] from Nvidia called retro [35:33] Plus+ um where they have this hybrid [35:36] between the Retro architecture and then [35:39] they do basically Rags sort of they put [35:41] the top one or the topk results in the [35:44] context of the language model after all [35:46] so it's sort of a crossover between Rag [35:48] and retro and they show some really nice [35:51] results here but I I think it's sort of [35:53] pointing to this uh big flaw I think is [35:56] that why is there still no good open [35:58] source retro [35:59] model that probably tells you something [36:02] about whether it actually really works I [36:04] I spent a lot of time in my career [36:06] trying to reproduce deep mind papers [36:08] that didn't necessarily always work uh [36:11] and so I I think the the same is true [36:14] for retro um and that's why we need to [36:17] do this in context rag on top of retro [36:19] to actually get it to [36:21] work but could it just be a true book [36:24] thing because you're searing onook [36:28] yeah but so [36:31] that no so the the doing retrieval over [36:34] that to over that big Corpus is not that [36:37] difficult actually yeah um so so they're [36:40] even like distributed pH packages you [36:43] can just do everything yourself so yeah [36:46] so in terms of comput it's it's actually [36:48] not that hard anymore to to reproduce [36:50] something like this uh but I've tried [36:53] several times and it it's not really [36:55] reproducible [36:57] so the only way to get it to work is if [36:58] you do this in context rag on top of the [37:00] Retro thing and then as you can see here [37:02] in the results then it actually gives [37:04] you a gain over the pure GPT model right [37:06] so it starts from a GPT and then they [37:08] kind of retrofit as they call it the GPT [37:12] model so in short I think there's still [37:14] a lot of work to be done in pre-training [37:16] these systems really from scratch uh and [37:18] retro kind of showed that it might be [37:20] possible but we don't necessarily know [37:22] exactly how to do it the right way and [37:24] this is really one of the interesting [37:26] open [37:27] questions um any questions on [37:33] that [37:38] online no okay then we'll move on um so [37:45] um let's go all the way with the [37:47] contextualization now right so so with [37:50] retro and with rag what we actually did [37:53] is we only updated the query encoder uh [37:56] so updating the document encoder is very [38:00] expensive so one of the first papers [38:03] actually kind of the the OG of the the [38:05] non-frozen dense retrieval augmented [38:07] methods is this uh paper called realm [38:10] this is really like Visionary work this [38:12] was basically the first uh uh kind of [38:16] version that did this properly where [38:18] they updated it all the way including [38:20] the document encoder um so can can [38:23] someone explain to me why it's expensive [38:25] to update the document en [38:30] coder so let's say we have a trillion [38:32] tokens in our Corpus right and now so [38:36] now we go all the way so we basically do [38:38] a forward pass we get a gradient at the [38:40] end now we back propagate the gradient [38:42] through the retriever we update the [38:44] query encoder now we have to update the [38:46] document encoder so what do we then need [38:48] to do after we've updated the document [38:50] encoder we need to re-encode the entire [38:53] internet right so basically every single [38:56] gradient update we have to re-encode [38:58] whatever our index is which so if this [39:01] is like trillions of tokens it's like [39:02] re-encoding the internet after every [39:04] batch update so that's not very [39:12] efficient [39:15] change [39:17] Stuff AC have [39:20] some [39:23] predictable [39:25] yeah [39:27] yeah that's one one way to do it uh so [39:29] so there there are a bunch of different [39:30] ways to update the the document encoder [39:33] so what they do in realm is they [39:35] basically do it for Te batches then they [39:39] stop they re-encode the entire internet [39:41] and then they train again uh so it's [39:43] sort of asynchronous updates they have [39:45] this very fancy sort of sharding [39:47] mechanisms where they take down uh [39:50] certain parts of their entire index uh [39:52] and then update them kind of on the Fly [39:55] uh so you can do it is just very [39:57] expensive so one one of the things that [39:59] a lot of people have been thinking about [40:00] not exactly theora idea but but similar [40:02] versions of that um are around like can [40:06] can you make it more efficient so that [40:07] you don't have to do do this [40:11] asynchronously um so one of the [40:13] downsides of this realm uh architecture [40:16] is that it's really just a bird model [40:18] but then you do this retrieval [40:19] augmentation on a bird model with other [40:21] bird models so it's not really [40:22] generative it's not really gen in the [40:25] modern Paradigm but if you want to read [40:27] like one paper uh on this topic like [40:30] this is a very good one to [40:31] read uh the other one that is is really [40:34] really good to read uh is this paper [40:37] called Atlas uh so Atlas is um uh so [40:41] this is out of fair um with a bunch of [40:44] folks the folks who did like Rag and the [40:46] folks who did FID and uh a really a [40:49] brilliant set of people and and this is [40:51] really a comprehensive uh analysis of [40:54] everything that's happening in this Arch [40:56] ecture so the first question they really [40:58] look at is how do we train this [41:00] retriever so we've seen a couple of [41:01] versions of this um but uh which one [41:05] actually works better they haven't [41:06] really been compared in a head-to-head [41:08] setting uh so one thing is we have this [41:10] FID Styles s vention distillation uh so [41:14] that's really too complicated to go uh [41:16] into detail here but the others are [41:18] actually very simple um so one is this [41:21] loss we've basically seen before right [41:24] uh so we've seen this I think with the [41:26] in context rag one right so we have a [41:28] stop gradient on the language model and [41:30] then we update the retriever the other [41:32] one is what we've seen with replug so [41:35] this is basically exactly the replug [41:37] loss right so we have the K Divergence [41:39] of the um the documents and and sort of [41:43] the Improvement that you see when you [41:44] give it that document uh the other thing [41:47] they have is basically the inverse of [41:49] that one so if I take this one document [41:52] out how does that affect my uh my [41:55] perplexity of the language model right [41:58] um and so this one I think is actually [42:01] quite elegant because that really gets [42:03] to like how valuable is this one single [42:05] document for me answering this question [42:08] correctly um so uh they compare all of [42:12] these different versions and uh what you [42:14] can see is that uh the the kind of [42:17] replug style loss and this leave one out [42:19] loss they perform a lot better than all [42:21] of these others so this fixed retriever [42:23] or no joint pre-training these are [42:25] really kind of the Baseline sort of [42:27] Frozen rag models or close book uh and [42:30] as you can see you can do really a lot [42:32] better uh if you optimize things and so [42:35] this leave one outing is probably the [42:38] best I would say um so then the other [42:40] question is how do you actually like [42:42] train that entire system like what data [42:44] or what tasks do you train this on so [42:46] they also uh experiment with a bunch of [42:49] different versions uh so one is uh doing [42:52] prefix LM if you're familiar with that [42:54] uh so they basically take a chunk that [42:57] occurs somewhere on the internet and [42:59] then they predict the next Chunk from [43:02] that chunk right so it's really like [43:04] sentence to sentence so maybe like skip [43:06] thought back in the day but now you have [43:08] this retrieval step where you predict [43:09] the next sentence uh then they just do T [43:13] T5 Styles sort of D noising so that's [43:15] Mass language modeling if you're [43:16] familiar with T5 um and then they have [43:19] this title to section generation piece [43:21] so um I think the takeaway from this [43:23] table is basically that whatever you do [43:25] here so they're using T5 model so [43:28] whatever you do here needs to be the [43:29] same that your uh language model expects [43:32] um so for T5 that's T5 style [43:35] loss um and then uh the the the next [43:39] sort of final question that they look [43:40] into going back to to what we talked [43:42] about how exactly do we update this [43:45] retriever uh so do we have to update the [43:47] document encoder or do we maybe have to [43:50] do some sort of reranking uh or do we [43:52] maybe just update the query um and and [43:55] quite surprising L I think they find [43:57] that just updating the query so like in [43:59] the original rad paper is actually [44:01] already basically good enough in many [44:04] cases so so that's nice because it's [44:07] much more efficient if you don't have to [44:08] update your documents all the time uh I [44:11] think the the real question here though [44:13] is like uh how good is your document [44:15] representation to begin with so you need [44:18] to have very very high quality embedding [44:20] model for this to work if you don't have [44:22] that then this will not work but if you [44:24] do have that then you get a very nice [44:26] kind of query side fine-tuning [44:31] thing U so the the atlas paper is about [44:35] trying to do F shop um sort of language [44:38] modeling tasks so it's how how many [44:40] examples are given in the [44:45] context um yeah so so the main takeaway [44:49] um here is that if you compare like the [44:51] Close book equivalent model to the [44:53] retrieval augmented model uh you see [44:56] very big [44:58] improvements that's really the only [45:00] takeaway of of this entire [45:02] section um but I I think that that's [45:06] really saying something uh in terms of [45:09] what we should be thinking about um how [45:11] how much time do I have [45:14] in [45:15] still okay okay all right other [45:21] questions are the documents in the [45:24] training step same as [45:29] yeah so they can be different um in so [45:33] in Atlas the athlet basically tries [45:35] everything uh so they also try to see [45:37] what happens if I train this on [45:39] Wikipedia But I swap in like a sort of [45:42] Comm and crawl index um and I think so [45:45] in Atlas but also in retro domain [45:47] finding is just the more the better uh [45:50] so it's really just like the bigger your [45:52] index the more likely you're you are to [45:54] find the exact right thing um and then [45:58] make the right [46:04] prediction any other questions on this [46:07] oh yeah uh sorry I this is a question [46:09] about the generator in the I guess uh [46:12] the rag system so um recently I saw a [46:17] paper on mistal 7B so it introduces a [46:20] lot of these uh new architectural [46:22] changes like the sliding window [46:23] attention to handle longer sequence is [46:26] at a smaller cost and the group query [46:28] attention for faster inference I'd like [46:30] I'd like to like know your thoughts on [46:33] designing a generator specifically for [46:36] rag uh leveraging for example where [46:38] mystal 7B currently is because for [46:41] example like the sliding window [46:43] attention I could see how that could be [46:44] adapted to the rag [46:47] case yeah so so maybe your read on sort [46:49] of what makes mol's special is a bit [46:52] different from mine so I I don't think [46:53] that the sliding attention window thing [46:55] is actually that interesting the reason [46:57] mrol works so well is because it's [46:58] trained on a lot of data uh and you can [47:01] do that more efficiently because you [47:02] have sliding window attention so you [47:03] don't need to attend to everything um [47:07] but uh so to answer your question I I [47:10] guess you're asking sort of about the [47:11] architecture of the generator if you [47:14] know that there's going to be a [47:15] retriever so I I I think uh that's [47:18] basically what retro tried to do right [47:20] so um retro actually some of the people [47:24] on the Retro paper are at Mistral now uh [47:27] so they they have this uh C chunk cross [47:30] attention idea here so you basically [47:32] have a language model but the way it [47:34] does the tension over the things you [47:36] retrieve in your retro um architecture [47:41] uh you they they kind of get integrated [47:43] into a model not using the standard [47:45] detention mechanism but using this [47:48] slightly different chunk cross [47:50] detention oh okay so I think the the [47:53] sliding window Attention Point I was [47:55] trying to get get at was that uh it uses [47:57] a fixed window so that whenever you're [48:00] doing the query key computation in the [48:02] attent with the query vectors and the [48:04] key vectors you're using a fixed window [48:07] attention so I think my idea was to [48:10] actually one use a dynamic window [48:13] because for example the rag case um if [48:16] you use a fixed window when you're doing [48:18] attenion it it is possible that you [48:21] actually are leaving you you're only [48:23] looking at a fixed uh span of [48:26] information so if you could maybe adapt [48:28] mistel so that you could make it better [48:31] for the ride case and and for example [48:33] the making the fixed window size the [48:35] dynamic window uh yeah yeah I think it's [48:39] an interesting idea so so for me uh the [48:42] the what m is doing with with the [48:44] sliding window that's basically like a [48:46] conet right so we had all these [48:48] convolutional like light comp Nets where [48:51] where we would have word embeddings and [48:52] you would do convolutions over it and [48:54] then pull uh and then you would still [48:56] get the information out so it's not that [48:58] the sliding window prohibits you from [49:00] looking earlier it's just that that [49:02] happens higher up in your Transformer [49:04] sort of yeah [49:07] yeah so I think that definitely is an [49:10] interesting direction to to think in [49:12] yeah yeah so I think um it's like not [49:15] too crazy to say are there any [49:17] architectural changes that we can [49:19] introduce into these 7 billion parameter [49:21] models so that they could be better [49:23] adapted to the rag case [49:27] yeah so there there there might be yeah [49:30] I I think one one question is just how [49:32] do you how do you do the attention over [49:33] things you've retrieved which I think is [49:35] what [49:37] you're yeah [49:39] thanks so just to make sure I understand [49:42] so I mean in this retro model you're [49:45] retrieving in each [49:47] block and when you talk about putting [49:50] the retrieval in the context are you [49:53] saying that you only do it at the [49:54] beginning you don't do it [49:57] yeah so so in context so this is it's [50:00] not exactly every layer sort of so it's [50:02] every token right so every um every step [50:05] basically not every block so doesn't [50:09] make sense so it's not every layer that [50:12] you do to retrieval yeah so every step [50:16] right um so so this is kind of like like [50:19] what rag token is so you retrieve every [50:21] token you so you generate and then you [50:24] can retrieve again or in the case of [50:26] retro you can generate like a chunk and [50:28] then you retrieve chunks again uh if you [50:31] look at the in context case you retrieve [50:33] once at the beginning and then you give [50:36] it you're say that during this [50:41] nobody yeah but so the so the in Contex [50:44] thing um so so here you don't actually [50:48] give it as context at all like directly [50:51] to the model right so here you get you [50:53] let the decoder kind of tend over [50:56] it [51:02] yeah so I don't think cross attention [51:05] really works yeah [51:10] yeah other [51:13] questions yeah we [51:15] inside the the training of the retriever [51:18] is not so necessary because of the [51:21] large uh so I'm wondering what inside of [51:24] the T like what cases are really need [51:29] toiz update or anyway updates [51:34] those yeah so you do want to update the [51:36] retriever right but but only part of the [51:38] retriever is necessary to be updated for [51:41] a lot of these these cases um but so so [51:46] I I think it uh so these are very [51:48] specific data sets right natural [51:50] questions wizard of Wikipedia and fever [51:52] so they're really very uh kind of [51:54] knowledge intens tasks uh so in that [51:57] case if you already have a very good [51:59] system like DPR that is specifically [52:01] pre-trained for those tasks then you [52:04] only need to update the query encoder [52:06] but so I would expect that if you move [52:08] Beyond this to kind of General language [52:10] modeling things like like retro then you [52:13] probably do want to update the document [52:15] encoder at least in a way where you can [52:17] scale [52:18] it so that in the this part that is very [52:23] much in [52:33] as long as we have a good opal knowledge [52:36] of what of the maybe the documents by [52:39] those uh good [52:43] models yeah but so you need to learn how [52:45] to kind of query into that Index right [52:48] so if you if you don't do that uh then [52:51] then yeah you don't get really good [52:53] performance so that's sort of like your [52:54] close book performance right if you just [52:57] have the language model and you're just [52:59] like what what does the parametric model [53:01] on its own without the retrieval what [53:03] does it actually know as you can see [53:05] there there are pretty big gaps there [53:11] right other questions otherwise I will [53:14] cover other [53:17] questions no uh hello yeah go for it a [53:21] quick question like so uh what about [53:24] like more here at retrieval like I [53:26] suppose there will be messes trying to [53:28] not just retrieve a single chunk but [53:30] some kind of like groups of chunks or [53:31] something or summarized versions there [53:34] there's been some interesting work on on [53:36] doing that uh where you first tried to [53:38] find so you can have multiple indices [53:40] and they can kind of cascade right so [53:41] first you want to find the relevant [53:43] document so you have some document [53:44] representation and then within that [53:46] document you want to find the relevant [53:48] chunk uh so you can do it sort of that [53:50] direction you can also do it in reverse [53:52] I think I I have something on the slide [53:54] there where you can find the chunk and [53:56] then sort of expand uh the context [53:59] around it and then give that to the [54:00] language model um so I think yeah there [54:04] are all kinds of interesting things you [54:05] can do [54:07] there cool H thanks I guess another [54:10] thing just like do can you compare rag [54:13] versus like long context L efforts so [54:16] there are lot of things like on around [54:18] just having really long context and [54:20] extreme it could replace rag but I know [54:22] like if your takes yeah so so my my uh [54:26] so everybody understands this question [54:28] right so there there's there's a trend [54:30] where we want to have very long context [54:32] language model so that basically you can [54:34] like take Harry Potter or something just [54:36] put it into context and then ask a [54:38] question like what is the name of like [54:40] Harry Potter's owl or something right [54:42] and then it can just attend over the [54:43] entire thing um so attending over all of [54:47] Harry Potter to answer that one question [54:49] is super inefficient right uh so most of [54:52] Harry Potter has nothing to do with the [54:54] AL uh so but you are still kind of [54:56] reading it if you do it with the long [54:58] context window um so that's why I think [55:01] the doing it the rag way where you have [55:02] this non-parametric component is a much [55:05] more efficient way to solve this problem [55:07] and if you actually look at the [55:09] literature on Long context Windows uh [55:11] the way they they solve the problem of [55:14] scaling the attenion mechanism is by [55:16] making it very sparse uh so they're [55:19] basically turning it so that's a [55:20] different kind of spars but they're [55:22] turning it into a non-parametric [55:23] retrieval problem uh kind of behind the [55:26] scenes so they're not they're not [55:27] actually all that different if you want [55:29] to scale long context then you're going [55:30] to move towards a rag style [55:34] architecture good [55:38] thanks all right um so let's talk about [55:41] some other interesting questions so one [55:44] thing and I already alluded to this is [55:47] when do we actually retrieve so very if [55:49] we're doing like if we want to uh like [55:51] retrieve every token that's also very [55:54] inefficient because I probably don't [55:56] have to retrieve to generate [55:58] the right I can probably do that on my [56:00] own with the language model is of a [56:02] wayte to go and retrieve stuff but if I [56:05] only retrieve once at the beginning of [56:07] the sequence that's probably also not [56:08] great right so so what we ideally want [56:11] to be able to do is to say okay [56:13] sometimes I want to retrieve sometimes I [56:15] don't want to retrieve and I'm going to [56:16] learn when I want to kind of expend the [56:19] the compute Budget on doing the [56:21] retrieval um so a nice paper where they [56:24] have a stab at this is called flare for [56:26] active retrieval augmentation where they [56:28] basically have the language model decide [56:31] uh when it should do a search and what [56:33] it should do to search for um so so I I [56:37] think this fits in a general Trend that [56:39] you can see in the field around kind of [56:41] Agents right so we can talk a little bit [56:43] more about that too um so this other uh [56:47] question that that I think we also kind [56:49] of covered already here is how do we [56:51] train this at scale right so we can do [56:52] these asynchronous updates we can do [56:54] reer rankers we can do query side only [56:57] there's this really nice paper uh which [56:59] is quite close I think to the idea you [57:01] proposed uh where you first use bm25 to [57:05] create a a batch basically where [57:07] everything is very similar uh in terms [57:10] of what you've retrieved and now you uh [57:13] have this kind of inbatch update so it's [57:16] it's sort of like a ranker where you [57:17] encode the information that is just in [57:19] your batch using this other model and [57:22] now you can update this model on the fly [57:24] so you don't have to worry too much [57:25] about doing the full kind of documents [57:27] side update um and again here what [57:30] really matters is like how big is your [57:32] index if you have an amazing index you [57:33] can basically solve any problem just by [57:35] looking it up right so rather than [57:38] cramming it into your parameters you can [57:40] just find it [57:43] um this is a really nice paper uh called [57:46] Silo so one one of the interesting [57:48] things I think that's going to happen in [57:50] the next year or two around language [57:53] models is there and you've seen this [57:54] already there's a bunch of like lawsuits [57:56] against open Ai and other places around [57:58] where does the data exactly come from um [58:02] so one uh very elegant solution I think [58:04] is to have a rag system that you train [58:06] on data that you know is safe so you can [58:09] train that thing on Wikipedia But now [58:12] during test time you can give it a data [58:14] store that has maybe slightly riskier uh [58:17] information in it so this massive index [58:20] of all the stuff on the internet [58:21] including some things that are maybe um [58:25] risk uh you can still have them in your [58:27] index but your language model uh your [58:29] retrieval augmented language model I [58:31] should say you know that that thing is [58:33] safe because it was strin on data that [58:34] is public domain uh so that's what they [58:36] do in Silo and they show that that works [58:38] really well so that's uh one possible [58:42] solution to to a lot of the the kind of [58:44] compliance and legal risk around [58:45] language model [58:48] deployments um there's a great paper and [58:51] also from one of your colleagues um [58:54] around uh contexts getting lost in the [58:57] middle I think this is also kind of a [58:58] fascinating phenomenon this is on a [59:00] frozen rag system um but U language [59:05] models are very similar to humans in [59:07] what things they pay attention to so if [59:09] you give them a bunch of things that you [59:11] retrieved what what they will look at [59:13] are like the first things you list and [59:15] the last things you list and they will [59:16] sort of ignore the middle um so if it [59:19] actually respected the rank function [59:21] then then this curve would go down all [59:23] the way right but it sort of go goes up [59:26] um so I I I think that's a a very [59:28] interesting observation which kind of [59:30] shows that how brittle uh these these [59:33] systems can be right so if you have a [59:35] frozen rag system it can be very very [59:37] brittle where like the order of the [59:39] retreat context matters a lot in whether [59:41] you get the right answer or [59:44] not work on treating this as re problem [59:48] sense [59:50] ofor like specifically going for [59:53] interpration out VOR that's going to [59:56] inter prodct with just the right maybe [60:00] you can tune for the particular [60:04] dat yeah so what what I just described [60:06] someone asked like how how do you [60:08] actually so I said there are other ways [60:10] to do this and then the question was how [60:12] do you do that so the way you do that is [60:13] using reinforce um so yeah there has [60:17] been work on doing that um so some of [60:20] the older papers were playing with this [60:21] but one one of the big problems with uh [60:25] so I think the replug solution isort of [60:27] more elegant uh for solving that problem [60:31] because you actually of use signal from [60:33] the language model and if you just do [60:34] reinforce it's very high variant so [60:36] you're uh it's it's going to be super [60:38] finicky if you don't want to destroy [60:40] your [60:42] index but people have tried it [60:47] though um so um uh there's some some [60:51] really nice work from open AI where they [60:54] they basically basically show and again [60:55] we're sort of like thinking more and [60:57] more about agents here right uh where [61:00] they show something very similar to the [61:02] flare result from earlier with active [61:03] retrieval that doesn't necessarily have [61:05] to be some index that you own it can be [61:07] just some some web search right um and [61:10] obviously in this case you don't really [61:12] have access to the web search [61:13] necessarily so Bing or whatever they use [61:15] here is not going to update its [61:17] parameters uh but I just wanted to kind [61:19] of put this in your mind like this is [61:21] another thing you can do right and if we [61:24] take this really to the general form uh [61:27] then you can think of language models as [61:29] just tool users um so rather than just [61:32] retrieval augmenting language models we [61:34] can tool augment language models and [61:36] retrieval is just one of the many tools [61:38] that language models have access to we [61:40] can have uh rankers and things on top of [61:43] the outputs of these tools um and so one [61:45] of the the big questions I think uh is [61:48] how do you actually get the system to to [61:50] learn stuff right so we're going to need [61:52] our help if we want this system to [61:54] really learn learn how to take these [61:55] actions uh [61:57] properly [61:58] um um and and so yeah this has been [62:01] taken to to the extreme in this uh sort [62:04] of self rag architecture where they have [62:06] this sort of retrieval step and it's [62:07] active and then you criticize it and [62:09] then you uh basically do some natural [62:11] language inference uh and all of that [62:13] just with one language model to answer [62:16] uh the [62:17] questions um so the other missing piece [62:20] so I'm just kind of going through a [62:22] bunch of open questions uh that that [62:24] people have looked at uh but feel free [62:26] to interrupt me if there's anything you [62:27] want to know um but so instruction [62:30] tuning we established at the beginning [62:32] of the lecture that this is pretty [62:33] important for getting things to work so [62:35] fixing the user interface um but the [62:39] instruction tuning has almost always [62:41] only happened on the language model and [62:43] not on the entire system so I think one [62:45] of the interesting uh things that people [62:47] are looking at now with with things like [62:49] RIT and instruct retro is how can we [62:51] instruction fine to an entire retrieval [62:53] augmented system so all the way into the [62:55] retrieval step can we generate data so [62:58] that that also follows the instructions [63:00] properly which currently doesn't happen [63:02] in any of these model [63:04] architectures um and then finally I I [63:07] think I would be remiss if I if I didn't [63:09] really talk about what people call [63:11] Advanced rag so so like the developer [63:13] Community has been really doing some [63:15] awesome stuff uh so like Frameworks like [63:18] llama index and Lang chain and there's [63:19] all these open source Vector databases [63:21] like groma and wv8 and they're all sort [63:24] of about making rag really easy but this [63:26] is all Frozen rag right but even with [63:29] frozen rag you can really do incredible [63:31] things um so uh we mentioned some of [63:34] these already so child parent recursive [63:36] retriever so you find small small parts [63:38] and then you give the big parts around [63:40] it to the language model you can do [63:42] hybrid search where we use reciprocal [63:44] rank Fusion so we have like different [63:45] search results that we then combine [63:48] before we give the final thing to the [63:49] language model there's zero shot like [63:52] large language model ranker so basically [63:54] the score function is not doesn't come [63:56] from your retrieval it comes directly [63:58] from the language model um and then uh [64:01] hypothetical document and Bets which I [64:02] think is a really cool idea so you just [64:05] uh basically you fix hallucination [64:07] through hallucination uh so you get a [64:10] question then you let the language model [64:12] hallucinate a bunch of possible answers [64:14] then you go and search for nearest [64:16] neighbors to the possible answers and [64:17] you give those as context and then it [64:19] gives the right answer based on that [64:21] right so it's really like hallucinating [64:23] answers and I think it's a brilliant [64:26] solution um so there's a lot of stuff [64:28] happening in in the kind of Frozen rack [64:31] Community uh to that I think is very [64:33] interesting to look at um so uh just to [64:37] wrap up kind of looking at the future of [64:40] this stuff uh there are still lots of [64:42] very interesting open questions so if [64:44] you're a student thinking about how to [64:46] solve any of these I think you can have [64:49] quite a lot of impact um so how how [64:53] exactly do we do like pre-training of [64:55] this architecture and do we even need to [64:56] pre-train I think even retro kind of [64:59] shows that you don't necessarily have to [65:00] pre-train so but maybe there's something [65:02] wrong with how we um how we do that what [65:05] do skating laws look like so I think [65:07] there's a really interesting question [65:08] here around if I have a huge index and a [65:11] very rich encoder of all the information [65:13] in that index maybe I can move so [65:16] basically decouple all the memorization [65:18] to this index so I have a language model [65:20] that doesn't know anything it just [65:22] speaks English it just sort of re on top [65:24] but it has no knowledge because that [65:26] always comes from this retriever if you [65:28] can do something like that then you get [65:29] very interesting scaling tradeoffs right [65:31] so you can have a tiny language model [65:33] and and do your retrieval uh to do a lot [65:36] of the heavy lifting with your retrieval [65:38] which is nice because that's a cach [65:40] computation right so you can just you [65:42] already have the the embeddings you just [65:44] need to do the dop product so it's much [65:46] more efficient than kind of self [65:48] attention in the language model um can [65:51] we move Beyond bu encoder so Vector [65:53] databases um I I like people who build [65:56] Vector databases but I'm not sure how [65:58] long we're going to keep Vector [66:00] databases um because u i I think rer [66:04] rankers probably work just as well and [66:06] bm25 is much more efficient than a [66:08] vector database um so I I don't really [66:13] see why we need dedicated Vector [66:15] databases and so what we're seeing but [66:17] maybe this is a bit of a critique of uh [66:20] maybe silicon value investment [66:22] strategies and things like that but a [66:23] lot of these [66:24] um um Vector database companies are [66:27] basically becoming database companies [66:28] now so they are adding all this Spar [66:30] stuff because the the densing is not [66:32] enough um and as it turns out there are [66:34] a lot of pretty good uh sparse databases [66:38] out there already like postgress and [66:39] things like that and they're also all [66:41] adding vectors uh to their databases so [66:45] uh I think that's all going to kind of [66:46] coales into [66:50] databases um so um I think there are so [66:54] interesting things to look at for kind [66:56] of the data so alluding to this [66:57] instruction problem can we generate much [67:00] better data for training rag systems [67:03] synthetically uh and then I think [67:05] there's this massive open question [67:06] around how we actually measure whether [67:08] the rag system is any good so right now [67:10] we just look at Downstream performance [67:13] um um which is sort of okay but if you [67:15] mess up the retrieval it's very hard to [67:17] measure um but how to how to measure [67:20] whether your retrieval is right is also [67:22] very difficult so there are some [67:23] Frameworks where they try to take like [67:25] the harmonic mean of your retrieval [67:27] accuracy and your language model [67:29] accuracy uh but I think those are also [67:31] very shy because we don't really have [67:33] very good uh data sets to measure that [67:35] on so I think that's that's a very cool [67:37] problem to work on as well um so the [67:41] other problem that I personally am [67:43] always very excited about is [67:45] multimodality um and so why would we [67:48] stop with rack systems with just text [67:51] right so you can do the same thing with [67:53] images uh you can augment language [67:55] models with vision so we did this work [67:57] on lens where we have a language model [68:00] enhanced to see uh where you can just [68:02] give kind of a computer vision pipeline [68:05] just like a retrieval Pipeline and give [68:07] that to a frozen language model and pass [68:09] it to the context and that system [68:11] actually is an amazing visual question [68:13] answering system it's close to [68:15] state-of-the-art uh sort of flamingo [68:17] from Deep Mind which is also very hard [68:19] to reproduce because there's no open [68:21] source version of that um [68:24] so so we've done some early work on this [68:26] in in 2021 uh where we have this cross [68:29] modal retrieval and there's some uh more [68:32] recent workout of fair where they also [68:34] look at this so I think that's really [68:36] like if you look at the trend in the [68:37] field like multimodality with GPD 4V and [68:40] things like that is really a Hot Topic [68:41] so everything is kind of going in that [68:43] direction uh so it's an interesting [68:45] thing to think [68:47] about um so overall I think um it would [68:51] be nice if everybody sort of moves away [68:53] from from rag 1.0 to Frozen Frankenstein [68:56] Rag and moves towards this much more [68:58] kind of optimized version rag 2.0 so [69:01] it's really about systems over models [69:03] right it's not just your language model [69:05] and your Retriever and they're kind of [69:06] separate it's about thinking from the [69:08] from a systems perspective about the [69:10] entire thing and the problem you're [69:11] trying to solve and so I think that [69:14] really is the way that in deep learning [69:16] things have always progressed where if [69:17] you optimize the system end to end [69:20] that's always going to win out like back [69:21] in the day in computer vision or NLP we [69:23] have like parsers and scam parsers and [69:25] all this kind of stuff and all that just [69:27] doesn't exist anymore now because we [69:30] optimize the system end to endend U so [69:32] that's what's going to happen here too U [69:35] so if we take that to the extreme like [69:36] there's a chunker thing in your [69:38] documents right like put cutting it up [69:39] into pieces like you could backdrop into [69:41] that like why not somebody should really [69:44] do that um and so yeah I I think like [69:48] trading off cost and quality uh and zero [69:50] shop domain generalization that's really [69:52] like where this stuff is going to come [69:53] in so language models right now they're [69:55] amazing but very often they're way too [69:57] expensive for being deployed somewhere [69:59] where you can actually make money from [70:01] them if you're in a company um so what [70:03] you want to do is make it much more [70:05] efficient and have the right cost [70:07] quality tradeoff and the the easiest way [70:09] I can think of is to do it through [70:10] retrieval augmentation but obviously I'm [70:12] I'm very biased um so uh yeah that that [70:16] was all I had actually um so if you're [70:18] interested in this I'm I'm at Stanford [70:20] so I can work with you on research [70:23] projects on these topics or if you want [70:25] you can also join contextual because we [70:27] work on this stuff every day thank [70:30] you well um sorry I had a question from [70:35] earlier yeah I think you said something [70:37] really uh really I think really super [70:40] helpful earlier about Mel 7B you talked [70:42] about you compared the sliding window [70:44] attention to convolutional neural [70:46] networks and I do see the parallel [70:48] because with convolutional neural [70:49] networks you have uh several layers of [70:51] several different layers of [70:52] convolutional layers and the top [70:54] convolution layers are able to see um a [70:57] larger receptive field than the bottom [70:58] convolution layers and um and with [71:01] convolution layers you're able to tune [71:03] the um filter sizes and the stride so [71:07] you're able to see a different receptive [71:09] field and I was wondering if you could [71:11] see that same innovation in mistal 7B by [71:14] tuning um because you have different [71:16] Transformer layers and each Transformer [71:18] layer will have a span over a different [71:19] set of tokens and if you can tune I [71:21] guess the Transformer architecture the [71:23] way you tune those convolution layers [71:25] the filter sizes the receptive field [71:27] perhaps we can do some optimization in [71:29] the Transformer realm that we have [71:31] already done in convolution layers yeah [71:34] I I think that so that's a good idea [71:36] there's there's a great paper on light [71:38] convolutions I think from Michael Ali [71:40] and David G and a bunch of people where [71:43] it's basically uh this this came out at [71:46] exactly the same time as the Transformer [71:48] and the Transformer is slightly more [71:49] optimized for GPU computation but the [71:52] the computional model was actually [71:54] slightly better than the Transformer um [71:57] so I it's definitely worth exploring [72:00] okay cool [72:04] thanks advant the re ranker [72:07] with that does that [72:12] advantages TR that yeah so it depends on [72:15] the problem I I I think what you [72:17] probably want to do is is sort of cast a [72:19] white net with bm25 and then just narrow [72:23] it down with then search uh so you you [72:25] often see that kind of as a two-stage [72:27] process where the first one is kind of [72:28] noisy you can add noise actually to your [72:31] retrieval and then you use the dense one [72:33] to filter it [72:35] down yeah everyone's trying to maybe [72:39] adap their models to [72:42] own domain specific area like I think [72:46] there are many two ways project one way [72:48] is to use instru tuning in learning way [72:52] or B tuning like [72:53] meth and another way is just the main [72:56] topic of this lecture is using rual or [73:01] so I'm Wonder besides the low cost [73:03] advantage of theal AED way do you think [73:07] the capacity or the quality of augmented [73:11] can be with those [73:13] T learning yeah so I I think actually [73:17] what what's going to happen is that all [73:19] of this will come together right so so [73:22] if you train things like end to end rag [73:25] 2.0 style then you can also fine-tune [73:27] that system on some use case end to [73:30] endend right so what why would you just [73:33] take the retrieval augmented system if [73:35] you can also F tune it on the thing you [73:37] care about so I think in the end [73:38] everybody's going to do all of those [73:40] things and then there's questions like [73:42] how do you do that efficiently so that's [73:43] why you would use adapter or things like [73:48] that think there was another [73:52] question I'm curious about Hardware you [73:54] say it's going to become database kind [73:56] of thing respons database but what about [74:00] retrieval hardware and you SM because [74:05] we've thought so much of the you know [74:07] the Le part but what about because it's [74:11] hug trillions said so you have any ideas [74:15] just a database problem so I don't know [74:17] if I'm allowed to say this exactly [74:19] actually but uh so one of the the [74:23] biggest chip manufacturers that recently [74:26] their stock has done really well they [74:27] have some dedicated retrieval Hardware [74:30] coming out I think soon or it might [74:31] already be [74:33] out um so yeah so yeah that [74:37] like very efficient uh dense retrieval [74:40] is a very big [74:46] business are [74:51] questions Sol [74:58] um yes I I think I think so if you take [75:01] it to the extreme so one of the big [75:03] problems right now is that that if you [75:05] contextualize an existing language model [75:07] that already [75:08] hallucinates then then it's going to be [75:10] kind of hard to get rid of the [75:11] hallucination right so if you do replug [75:13] on [75:14] gp4 gp4 might still hallucinate so you [75:18] it could basically just ignore all the [75:19] stuff you retrieved and just do whatever [75:21] it wants anyway uh so that's one of the [75:23] reasons why you want to train the system [75:25] end to end and if you take that to the [75:26] extreme where like I said right if you [75:28] can just have the language model only [75:31] reason and speak so it knows English and [75:33] reasoning but it has no knowledge which [75:35] all comes from somewhere else then then [75:38] you can't lose an so it's really all [75:40] grounded in whatever is in your [75:47] index but they're so they're they're [75:49] about hallucination I I'm sort of [75:51] frustrated that a lot of people in the [75:53] field misunderstand what hallucination [75:55] even means right so a lot of people are [75:57] conflating hallucination with [75:58] correctness or incorrectness so they're [76:00] like oh the model made a mistake it [76:02] hallucinated it's like no it made a [76:04] mistake that's different from [76:06] hallucination hallucination I think is [76:07] very specific kind of I retrieved [76:10] something so I have some sort of [76:11] counterfactual ground truth and what I'm [76:14] saying uh does not correspond to that [76:16] ground [76:17] truth um and so yeah I think there's a [76:22] bunch of folks that stand for also [76:23] working on better like measurements of [76:25] hallucination and definitions and things [76:27] like [76:30] that understanding correctly your of [76:33] hallucination only sense in [76:36] cont yeah of some ground truth right so [76:40] so Hallucination is is really like there [76:43] there is something that is true right so [76:45] so if we're talking about like [76:47] hallucination yeah so if we're talking [76:48] about just general parametric language [76:50] models then sort of the ground truth is [76:52] whatever we can consider to be true [76:56] right but we had to word for like [76:59] language models making mistakes before [77:01] it was called making [77:06] mistakes yeah [77:08] ground I guess you're solving the house [77:12] question on that path are you working on [77:15] on [77:17] ground you [77:19] know never been president everything [77:26] this yeah so so I I like the sort of [77:29] Silo mention there as well so I I think [77:32] the whole point is that you can you can [77:35] have different indices and different [77:36] definitions of ground truth and so um I [77:39] think you could say I only trust the [77:42] archive or I only trust like peer review [77:44] papers and not just archive uh and so [77:47] you can make decisions in your [77:49] architecture during test time about what [77:50] You' Define as ground truth [77:53] and I also think actually that uh and [77:57] there's a bunch of work I think [77:58] happening on this right now you can [77:59] control for how how grounded you want to [78:01] be in your ground TR so uh that's [78:05] another kind of misconception about [78:06] hallucinations like sometimes [78:08] hallucinations are actually good right [78:10] if you have a creative writing assistant [78:11] and you wanted to come up with some cool [78:13] new ideas you want the language model to [78:15] hallucinate uh so I I think what you [78:18] want to have is kind of a tunable knob [78:19] where you say like now you can [78:21] hallucinate and now maybe you should [78:22] like really tell me the truth [78:30] only anything [78:38] else control [78:41] yeah yeah so but the temperature that's [78:44] just about how you sample right so how [78:46] flat your your distribution is [78:50] sample [78:51] yeah [78:53] yes but so even if you have a low [78:55] temperature it can still come up with [78:57] random stuff right so it just says that [79:00] then you're very likely to do like [79:01] greedy sampling um so so I I think what [79:05] you want to get at is is something more [79:07] sophisticated than [79:14] that lots of interesting questions yeah [79:17] I like the question thank again for the [79:19] great [79:21] than