[0:05] hey guys welcome to the our last lecture
[0:08] um of this quarter and we're very happy
[0:12] to have a daa here he's a the CEO of
[0:16] contextual AI um the Enterprise llm
[0:19] company as well as an Adjunct professor
[0:22] in symbolic systems here at Stanford and
[0:25] previously he was the head of research
[0:26] at hooking face and before that a
[0:28] research scientist Facebook AI research
[0:32] uh he received his PhD and masters from
[0:34] the University of Cambridge um as well
[0:36] as a master in logic from the University
[0:38] of Amsterdam and studied philosophy and
[0:40] cognitive AI in undergrad um and his
[0:43] work focuses on machine learning as well
[0:45] as NLP specifically on developing better
[0:48] models for language understanding and
[0:50] generation and better tools for
[0:52] evaluation and Ben yeah give it up for
[0:56] Adella
[0:58] right thank you so I guess I have to
[1:00] sort of stand here in the corner so
[1:02] people can see me on the zoom as well um
[1:06] yeah thanks so much for having me here
[1:09] um so I asked Steph what I should talk
[1:11] about there were a couple of things I
[1:13] could talk about multimodality or
[1:15] evaluation uh and this was the preferred
[1:18] topic I guess uh because the others were
[1:20] already covered um so yeah I'm I'm very
[1:24] happy to talk to you about everything
[1:25] retrieval augmentation um I think this
[1:28] is really one of the topics right now in
[1:31] our field um so I I'll just give you an
[1:34] overview of what's been happening and
[1:36] what I think are the interesting
[1:37] questions to think about um so first of
[1:41] all obviously in case you've missed it
[1:43] we are in the age of language models um
[1:46] and I just wanted to do a quick poll
[1:48] here in this not not super big audience
[1:51] I guess there's more people on the zoom
[1:52] but uh who invented language
[1:55] models if if you thought open AI then
[1:58] I'm angry with you right so so actually
[2:01] uh this is a very very old idea so the
[2:04] idea is just you you take a sequence and
[2:06] you factorize out the token
[2:07] probabilities right and um so it wasn't
[2:11] invented by open AI it's not like a few
[2:14] years old it's actually several decades
[2:16] old uh so I'm bringing this up because I
[2:19] was talking to someone and they were
[2:20] like open AI invented language models
[2:22] and I was like you're kidding me right
[2:24] um so um I I I went back to the
[2:27] literature and this is the oldest one I
[2:29] could find actually 1991 first neural
[2:31] language model um there's a very nice
[2:34] paper from 2003 from
[2:36] pjo where they they actually have like
[2:38] word embeddings and everything already
[2:40] in there uh so obviously these are LMS
[2:43] not llms um and as it turns out if you
[2:46] make them really big and you
[2:48] parameterize them with these massive uh
[2:50] neural Nets then you get something
[2:52] really powerful that really shows
[2:53] emerging uh properties right and that's
[2:55] why we're all excited in this stuff um
[2:59] so if we think about this from like a
[3:01] classic CS perspective there's input
[3:03] output right there's this kind of thing
[3:05] in the middle it's the generator so uh
[3:07] we take a sequence the input sequence
[3:10] and then the the task of the model is to
[3:12] predict the next token very very simple
[3:15] model um and and so you know that's why
[3:18] it was so easy to come up with this in
[3:20] 1991 already because it's like the idea
[3:22] is very intuitive but for a long time
[3:25] what was really broken with this was the
[3:27] user interface um and and this I think a
[3:31] lot of people kind of misunderstand what
[3:33] Chad gbt was about that's really what
[3:35] Chad gbt fixed right so that in
[3:38] initially you had to come up with these
[3:39] very weird prompts in order to get your
[3:41] language model to do what you wanted it
[3:43] to do uh and humans are terrible at this
[3:46] right so so we're much better at sort of
[3:48] telling people or things around us what
[3:50] we want right so if we have a dog we say
[3:52] sit we don't prompt it in a very weird
[3:54] way so that it sits right and it's the
[3:57] same with the language model if you
[3:58] wanted to generate some R lyrics in the
[4:01] style of a pirate or Shakespeare or
[4:03] something then you tell it generate some
[4:05] R lyrics in the style of a pirate right
[4:07] so that kind of instruction data
[4:10] actually turns out to be super super
[4:12] rare in just web data so what you need
[4:14] to do is you need to fix the user
[4:16] interface to the language model and the
[4:18] the classic recipe for doing that is the
[4:21] the sequence basically that chat gbt
[4:22] used right so you promp the model in a
[4:24] specific way you instruction find in the
[4:26] model and do you do some alignment rhf
[4:29] uh whatever you do on top of that so
[4:31] that's the first thing so now you have a
[4:33] working language model with a working
[4:36] user interface so are we done then um
[4:40] obviously we're not right so so right
[4:42] now language models are are kind of
[4:43] taking the World by storm but if you
[4:45] talk to anyone especially in an
[4:47] Enterprise for example where they have
[4:48] very strict uh accuracy requirements
[4:51] they will tell you that they can't
[4:53] really productionize this yet um and the
[4:55] reason is because there are all these
[4:57] familiar problems probably a bunch of
[4:58] you are working on these problems right
[5:00] now uh around
[5:02] hallucination um so these models they
[5:04] kind of make up stuff very often with
[5:05] very high confidence which is uh even
[5:08] more scary in a way attribution so we
[5:11] don't really know why these models are
[5:12] saying what they're saying Stillness
[5:15] they go out of date and so this was a
[5:17] big problem with sort of chat GPT not
[5:19] knowing anything that happened after a
[5:20] certain cut off date and they keep
[5:22] updating it every once in a while but
[5:24] you want to have a system that's always
[5:25] completely up to date that never goes
[5:27] still um you want to be able to to
[5:29] revise the information in the system so
[5:32] uh if you're uh a European organization
[5:34] you have to worry about gdpr uh which
[5:36] means that you need to be able to remove
[5:38] information from the language model or
[5:40] maybe Revis facts uh which we don't
[5:42] really know how to do right so again
[5:44] this is a very interesting uh area of
[5:46] study for a lot of folks model editing
[5:49] um but so this is something that we
[5:51] really want to be able to fix and then
[5:53] there's this big question of how do you
[5:55] customize these models uh so different
[5:58] people have different use cases you have
[6:00] different data if you're a company or if
[6:01] you want to have a language model on
[6:03] your own data how do you make it work on
[6:05] your own data so one of the solutions uh
[6:08] that everybody has started using right
[6:11] now is to couple it to an external
[6:12] memory so that's really just rag right
[6:15] the uh we we can this whole lecture is
[6:17] basically about rag uh but the way to
[6:20] understand uh what is going on here is
[6:23] uh we have this generator just like
[6:25] before we have the input and a prom just
[6:27] like before but now uh instead of just
[6:29] those two things we give this additional
[6:32] context so we contextualize the language
[6:34] model using things we retrieve and and
[6:37] the retriever uh is is very often pretty
[6:40] simple it's just a query in a documents
[6:42] encoder um and then you get a bunch of
[6:44] documents you give them as context
[6:46] through the model so super simple
[6:49] architecture um and I think it's useful
[6:53] to think about it from the perspective
[6:54] of of these two separate paradigms uh so
[6:57] if you've ever taken an exam I'm sure
[6:59] you have right uh you can have a close
[7:01] book exam where you have to memorize all
[7:03] of this so you have to cram all the
[7:04] knowledge into your parameters your
[7:07] neurons uh or you have an open book exam
[7:09] where you have all of this information
[7:11] in the book that you can access when you
[7:12] do the exam uh so it's a very similar
[7:15] thing with rag right you can just make
[7:17] it an open book setting where you give
[7:18] it access to this external information
[7:21] Wikipedia or something else or basically
[7:23] the entire internet uh and then have the
[7:26] language model do its job without having
[7:27] to memorize all of it in it
[7:30] parameters um so the other I think
[7:32] useful distinction here is that uh
[7:35] cramming everything into your parameters
[7:36] that's the parametric approach right so
[7:39] U what we're doing with rag is we're
[7:40] adding this non-parametric retrieval
[7:43] component um so uh you might call this
[7:45] semi- parametric um if you want to give
[7:48] this a
[7:49] name all right so why why does that
[7:52] actually solve these issues and so the
[7:55] answer is basically that if you have
[7:57] this separate Index right this separate
[7:59] retriever you can swap it in you can
[8:01] swap it out you can replace it with a
[8:03] new index so you can really customize it
[8:06] and so you can customize your language
[8:07] model system for what the user really
[8:10] wants to see um and then obviously you
[8:13] can update this index so um it doesn't
[8:15] really go still and you can revise it if
[8:17] everything goes wrong if anything goes
[8:20] wrong uh the other thing you get is
[8:22] grounding right so that that's initially
[8:24] why I became interested in this kind of
[8:26] architecture because I was thinking a
[8:27] lot about grounding and multimodal and
[8:29] things like that and actually one really
[8:31] nice way to ground things is to find
[8:33] some other information that you can
[8:35] ground your generation in so you really
[8:37] want the language model to only say
[8:39] things that it has evidence for in this
[8:41] outer piece of text or even multimodal
[8:44] data that it retriev separately so if
[8:46] you do that then you get less
[8:47] hallucination because you can always
[8:49] point back to your Source it's always
[8:50] grounded in your Source um and you get
[8:53] attribution because you know why the
[8:54] model is saying what it's saying it's
[8:56] because it founded this thing here is
[8:59] that
[9:02] all right so um for the rest of this
[9:05] lecture we're going to talk about this
[9:06] this basic architecture um and so it
[9:10] kind of looks like a pretty simple thing
[9:12] right uh but there are actually lots and
[9:13] lots of questions you can ask about what
[9:16] what this system should really look like
[9:18] um and like this this doesn't even cover
[9:20] like half the questions you can ask so
[9:23] it really is about how how do we
[9:25] optimize this entire system right so we
[9:28] have the separate components the
[9:29] retriever the generator and then um
[9:32] there are things like this query encoder
[9:34] how do we encode queries how do we uh do
[9:37] the retrieval do we update the document
[9:39] encoder how do we actually uh Define a
[9:42] document right is it like a full
[9:44] document or is it a paragraph or a chunk
[9:46] or a sentence or a couple of words um so
[9:48] there are lots of questions to ask and
[9:51] and uh as you'll see there are lots of
[9:53] possible answers to these questions as
[9:55] well um so this is what we'll we'll
[9:58] cover
[10:00] um so there are lots of
[10:03] architectures um going into these
[10:05] questions and I think as we go through
[10:07] them it's useful for you to think about
[10:10] what happens during training time and
[10:12] what happens during test time right so
[10:14] during training time is really uh okay
[10:16] we have this language model we have this
[10:18] retriever um which one do we update how
[10:21] do we update them how do we train this
[10:23] entire system do we maybe not train it
[10:25] at all uh do we pre-train it from
[10:28] scratch do we initially I it with uh
[10:30] components that were already separately
[10:32] trained these are the kinds of questions
[10:34] that that you have to answer if you want
[10:35] to design a system like this and then
[10:38] during test time uh you have this entire
[10:41] system right so actually multiple models
[10:43] in a way uh that are working together um
[10:47] so so there's also different things you
[10:48] can do there right so give it different
[10:50] indices during test time or uh
[10:52] manipulate kind of how you're sampling
[10:54] things like
[10:55] that so um the starting point for all of
[10:59] this stuff I think if you ask someone
[11:00] now like what is rag they will think of
[11:02] this thing um so this is frozen rag
[11:06] basically uh there's no training here at
[11:09] all so going back to this question of
[11:10] train time test time there's only test
[11:12] time here train time happen separately
[11:14] with these kind of blackbox models that
[11:16] we don't necessarily have control over
[11:18] right so there's this document embedding
[11:20] model uh whatever is currently at the
[11:23] top of some open source uh leaderboard
[11:26] uh you use that to oop sorry uh to get
[11:29] some vectors that you then use to create
[11:32] this Vector database and then the vector
[11:34] database just does search and it gives
[11:36] the information from the search to the
[11:38] language model and it just passes it as
[11:41] uh as the context right so this is this
[11:44] only works because of in context
[11:46] learning um and you know I think as a as
[11:50] a machine learner myself this feels very
[11:52] inelegant um so what what this lecture
[11:55] is about is can we do better than than
[11:57] this Frozen
[11:59] thing um so let's let's start from the
[12:03] the left side of this like okay if we
[12:05] want to outperform this Frozen thing
[12:07] itself with just the vector database
[12:09] like what would that look like from a
[12:11] retrieval
[12:12] perspective um and the starting point
[12:15] for everything retrieval is is tfidf
[12:17] does everybody know what tfidf is no
[12:22] okay so so tfidf is basically a sparse
[12:25] retrieval method where you have a score
[12:27] function uh that that looks at documents
[12:30] and queries so D and Q and then there
[12:33] are basically two terms that matter one
[12:35] is the TF the term frequency and the
[12:37] other is the IDF the inverse document
[12:39] frequency so this inverse document
[12:41] frequency is actually a really nice idea
[12:43] from Karen spark Jones really underrated
[12:45] researcher she's done some amazing work
[12:48] um but the basic idea is that you want
[12:51] to look at the words that are very
[12:53] special so that don't occur in lots of
[12:54] different documents and so the overlap
[12:56] between the word the doesn't really
[12:58] matter matter right like the occurs
[13:00] everywhere so you want to have sort of
[13:02] the special words so that's what what
[13:04] tfidf does in a nutshell it gives you a
[13:06] score for document query overlap and
[13:10] then you can do all kinds of things here
[13:12] with how how you weigh it so there's all
[13:14] these weird different parameters like
[13:15] this B and things like that that allow
[13:18] you to make it better than just having
[13:20] the the tfidf score so there's a couple
[13:22] of tweaks you can do there so bm25
[13:25] actually in case you're wondering stands
[13:27] for best match 2
[13:29] so I I try to discover like where does
[13:31] the 25 actually come from uh that's
[13:34] because the the prior s the preceding 24
[13:37] experiments failed right so it's
[13:39] literally the 25th one that seemed to
[13:41] work and that's why it's called
[13:42] bm25 it's bizarre right but um um so so
[13:46] this is spars retrieval it's just
[13:48] counting words right so you have this
[13:50] massive massive Vector of all these word
[13:53] occurrences it's sparse because most
[13:55] words never occur right so it's sort of
[13:57] like a vector of uh vocabulary size
[14:01] dimensions so most of that is obviously
[14:03] zero um but so that's actually kind of a
[14:06] nice property if you want to do fast
[14:08] search on a CPU right because on a CPU
[14:10] sparse uh do product is very easy to
[14:13] compute so um this is used in in the
[14:16] system called uh Dr QA which is really
[14:19] one of the first neural instances of
[14:22] this open domain sort of open book
[14:24] question answering Paradigm um so you
[14:27] have a question like how many of
[14:29] warsaw's inhabitants blah blah uh so you
[14:32] want to ask basically Wikipedia what the
[14:34] answer is for this so then you have this
[14:36] document retriever based on the sparse
[14:38] so bm25 I think in this case uh
[14:41] retrieval methods you pass that to um at
[14:44] this I think this was still by lsdm at
[14:47] the time um a document reader model and
[14:50] then that model gives you the answer um
[14:54] so this I think is really the first
[14:56] instance of having sort of this
[14:57] separation between a retrieval and a
[14:59] generator system that you use for
[15:02] answering complicated questions based on
[15:03] sort of open domain
[15:05] knowledge um so after The Spar stuff um
[15:10] there was a bunch of work on dense
[15:11] retrieval and and so the advantage of
[15:14] dense retrieval so this is just like
[15:16] word and Benes basically vectors right
[15:18] they're they're dense now no longer
[15:19] sparse so they're much uh smaller in
[15:22] terms of dimensionality and the nice
[15:24] advantage of of dense retrieval is that
[15:27] it's not really about specific work
[15:28] right so uh if there're synonyms you can
[15:31] still um find the relevant document uh
[15:35] which you couldn't really do with a
[15:36] sparse representation right so that's
[15:38] really the advantage of DSE is that you
[15:40] get like semantic
[15:41] similarity um so you can do this over
[15:45] word embeddings that doesn't really work
[15:46] all that well but uh at the time that
[15:49] people started thinking about this ber
[15:50] was already out there and ber is really
[15:52] great for giving you a vector
[15:53] representation for an entire sequence of
[15:55] words right so a sentence representation
[15:57] or a passage representation
[15:59] so there are all these cool systems like
[16:01] Orca and uh DPR the dense passage
[16:04] retriever where um they essentially use
[16:08] the retrieval as a kind of latent
[16:09] variable in the system U and and the way
[16:12] to get the latent variable to to work to
[16:14] be good enough essentially to train the
[16:16] entire system is to pre-train the
[16:19] retriever on uh relevant information so
[16:21] for Ora they do something called inverse
[16:24] close uh so they do kind of a close task
[16:27] where you want to find
[16:29] um passages that are sort of relevant to
[16:31] the preceding passage and in DPR they
[16:34] just train it on on a supervised thing
[16:36] but really the core idea here is that uh
[16:38] as you can see in this graph here you
[16:40] can do better than bm25 if you add lots
[16:43] of documents and the way you compute
[16:45] this score function is much simpler it's
[16:47] just a d
[16:48] product right um so the nice thing about
[16:52] D products is that you can do them very
[16:55] very efficiently on the GPU as well um
[16:58] if you uh know what you're doing so what
[17:01] you really want to get at is maximum in
[17:04] product search mips right this is one of
[17:05] the kind of core ideas of a lot of this
[17:07] stuff um and you can do mips with Ann
[17:12] approximate near neighbor search um and
[17:14] so there's this this really uh brilliant
[17:17] piece of work out of there for my
[17:19] colleagues at the time uh called phas
[17:22] which really underlies all of these uh
[17:24] modern Vector databases right so like
[17:27] all the popular ones they sort of
[17:28] re-implementations of this face idea one
[17:30] is in like rust one is in go but it's
[17:32] all basically the same idea it's just
[17:34] face um and so so face really Powers a
[17:37] lot of this stuff um and whenever
[17:40] somebody tells you something about a
[17:41] vector database just think about face
[17:44] very fast do
[17:46] product um so obviously you can go
[17:49] beyond do product yes what is it what is
[17:53] face um so so it's an open source
[17:56] Library Facebook AI similar
[18:02] search no so it's just basic off the
[18:04] shelf Ann
[18:09] algorithms yeah so so there are all
[18:12] kinds of different I don't know if you
[18:13] do you know what like product
[18:14] quantization is and things like that so
[18:17] there they're basically so you have a
[18:18] bunch of vectors uh and you can just
[18:21] compute the full dot product which is
[18:23] sort of inefficient right so what you
[18:25] can do is try to compress uh subspaces
[18:28] of the vector and then just look at the
[18:30] kind of
[18:31] centroids um so this so you can quantize
[18:34] sub vectors of the full vector and then
[18:36] do much faster search over just the
[18:41] centroids it's good question any other
[18:46] questions um all right so so about this
[18:49] dot product idea right so so what we
[18:52] have here is some people call this a
[18:54] Siamese Network I guess it is right so
[18:56] you have two different bir models uh or
[18:59] whatever your encoder is here and then
[19:00] at the end you get these two vectors and
[19:02] then you just do do product so you get
[19:04] one single score but you can do all
[19:06] kinds of much fancier things if you if
[19:08] you're willing to give up on this buy
[19:10] encoder uh approach right um so really
[19:13] nice example from from one of your
[19:15] colleagues here at Stanford uh is
[19:17] Colbert um so what this does is is late
[19:21] interaction uh so so instead of just
[19:24] having this dot product here you have a
[19:26] kind of more complicated uh
[19:28] version of computing a score where you
[19:30] aggregate over sort of Maximum
[19:32] similarity scores between different
[19:34] words so I only recently actually
[19:36] discovered that this is called Colberg
[19:38] because of the late night show Colberg
[19:40] so it's sort of Omar's joke actually
[19:43] this name but just just so you know if
[19:45] you run into it um so um but but I think
[19:51] if if we look at kind of where the
[19:52] state-of-the-art has has been going now
[19:55] one of the nice things about these
[19:56] Vector databases is that they're super
[19:58] efficient right so dot product is much
[20:00] more efficient than this late
[20:01] interaction stuff especially if you do
[20:03] the approximate nearest neighbor search
[20:05] um but there's been some really cool
[20:07] work so things like Spade uh they
[20:11] basically have have sparse meat dents in
[20:14] a way so one of the big problems as I
[20:15] said with spars is that you can't really
[20:17] handle synonyms and things like that but
[20:19] what you could do is take a dense model
[20:22] Like a Bird model look at kind of this
[20:24] this one word in your sequence try to
[20:27] see which other words in the same slot
[20:29] so that gives you the synonyms uh so now
[20:32] you can give all these synonyms to a
[20:34] sparse uh vector and then you can just
[20:36] do Spar doll product and so have a much
[20:39] much more efficient way to do search uh
[20:42] without sort of giving up on all the the
[20:44] cool stuff that you get from a dense
[20:46] representation um so that's one thing
[20:49] and this other idea I really like uh is
[20:51] called Dragon um so this I think is
[20:54] really the the the best generalized D
[20:57] dense retriever so if you want to take
[20:58] something off the shelf right now and
[20:59] just go to hugging face or something
[21:01] then this dragon or Dragon plus is
[21:03] probably the thing you want to use for a
[21:05] dense Retriever and the way they train
[21:07] this is is through this Progressive data
[21:10] augmentation strategy to make them the
[21:12] model better and better over time by
[21:13] sampling very difficult negatives um and
[21:16] that gives you very good uh
[21:19] representations um and and so the other
[21:21] thing about this I think this is the
[21:23] only only sort of final point about uh
[21:26] retrieval in general is that is that
[21:27] what we see happening right now if you
[21:29] look at sort of the developer Community
[21:31] around rag is that they're all doing
[21:32] hybrid search right now uh so you can
[21:35] actually just combine the search results
[21:37] from your sparse bm25 or whatever thing
[21:40] or spade and you can combine them with
[21:42] your dragon uh and then you get uh this
[21:45] ranking that works even better uh so
[21:47] then you kind of get Best of Both Worlds
[21:48] but then you get all these questions
[21:50] about how do you combine the
[21:52] results um any any questions on on this
[21:56] part oh can you hear me
[21:59] yes oh sorry um on the earlier slide uh
[22:02] was there has there been any work on um
[22:04] Benchmark how much less hallucination
[22:07] rag incurs over a closed book question
[22:10] answering for example directly asking
[22:12] the large language model the question
[22:14] has there been any benchmarking studies
[22:16] in this yeah so there there's a great
[22:18] paper if I can say so myself on the fact
[22:21] that retrieval augmentation reduces
[22:23] hallucination uh it's from 2021 I think
[22:26] um so so yeah you can just F find if you
[22:29] literally look for retrieval
[22:30] augmentation reduces hallucination then
[22:32] you'll find the paper uh thank
[22:43] you yeah so so uh very often you want to
[22:47] have um an very precise word overlap for
[22:51] things where you don't want to have the
[22:53] synonyms or the kind of nearest
[22:54] neighbors right so um if there's like a
[22:57] brand name name or or something like
[22:59] that then like let's say the brand is
[23:01] apple right you don't want to find stuff
[23:03] about pairs right so that's what you
[23:05] would do with a dense retriever um so so
[23:08] it really kind of depends on what you
[23:11] want to use it for that's why hybrid is
[23:13] probably the way to
[23:14] go it's a good
[23:17] question with the
[23:19] dance it's
[23:21] um it's contextualized that but
[23:24] shouldn't it realize Apple the company
[23:26] would be different from no so so if they
[23:29] were actually contextualized then yes
[23:31] but but very often it's a a frozen
[23:33] retrieval system right that's one of the
[23:35] problems with all the Frozen rag
[23:41] stuff I might be missing very
[23:44] B refering to the factors that
[23:48] you're factors that you're using is
[23:52] or uh no so so the the the the sort of
[23:58] document and the query that they're the
[24:00] same right so they're either sparse or
[24:02] they're dense but so if they're sparse
[24:04] the components of the vector are are
[24:06] literally the other
[24:09] work you just Oneal when
[24:12] you're the thing that
[24:16] creates uh how are you getting so it's
[24:20] literally counts right so so basically
[24:23] it's a one big Matrix of documents as
[24:26] rows and the columns are the words in
[24:28] the documents and then you just count
[24:30] how often a word occurs in a document
[24:33] right so that's as
[24:35] far also
[24:39] refering yeah and so so in the field we
[24:42] call them sparse sparse embeddings or
[24:45] sparse retrieval because most of that
[24:47] Vector is zero right because most wordss
[24:50] don't occur in that
[24:53] document does that make sense
[24:56] yeah
[24:58] cool um so um let's talk about uh doing
[25:04] slightly better so so going back to
[25:05] Stephen's question about okay we we have
[25:07] this kind of retrieval thing but like
[25:09] how do we actually make this retriever
[25:11] good for the context that is going to be
[25:13] used in right so can we contextualize
[25:15] the retriever for the generator uh even
[25:18] if it's it's a generator where we might
[25:20] not have access to the weights so it
[25:22] could be a gp4 model we just send it to
[25:24] some API we get some stuff back um
[25:28] and so uh one paper I really like is
[25:30] called replug um so just just to kind of
[25:33] explain what this looks like so you have
[25:35] this context you have a retriever that
[25:37] we do the the standard retrieval set
[25:39] with this is a dense retriever um and
[25:42] now sorry um and now you uh compute the
[25:45] the likelihood so basically just
[25:47] normalize the scores that you get for
[25:49] for the topk documents to get a
[25:52] distribution here and then uh you give
[25:54] each one of the retrieve documents
[25:57] separately to this generator to your
[25:59] language model so you can look at the
[26:02] perplexity of the correct answer for
[26:04] that language model right so now we have
[26:06] these two probability distributions or
[26:08] two likelihoods essentially and we can
[26:10] minimize the KL Divergence to make sure
[26:13] that we can actually uh retrieve the
[26:15] documents that lead to the lowest
[26:17] perplexity on the right answer for the
[26:19] language model um so super simple idea
[26:23] uh works really really well uh and the
[26:26] nice thing about this is is completely
[26:28] uh agnostic of what happens Upstream
[26:30] right so this will work for any sort of
[26:32] encoder decoder for any language model
[26:35] um what what you need is a perplexity
[26:38] score uh but for most language models
[26:40] you can get that not necessarily all of
[26:42] them so that's one thing and then
[26:44] there's this other really nice approach
[26:47] um what you what parameters are you
[26:50] changing so so in the retriever you're
[26:53] you're literally updating the uh the the
[26:56] dense representations
[26:58] right so your encoder basically for your
[27:00] dense representation that's good
[27:01] question we'll get more um so there's
[27:05] this another paper uh on in context
[27:07] retrieval augmented language models
[27:09] where the whole paper is basically about
[27:12] just doing bm25 and just giving stuff
[27:15] directly to the context of the language
[27:16] model and things kind of work so it's
[27:18] it's sort of Frozen rag but even even
[27:21] more primitive in a way where the the
[27:23] retriever is uh this very old sparse
[27:26] algorithm but it works really really
[27:27] well um but then they have this really
[27:30] awesome section where they they show
[27:32] that you can just have this uh ranker on
[27:35] top of the bm25 results um and you can
[27:38] backdrop into this ranker so now you
[27:40] still keep the language model completely
[27:42] fixed uh so that's sort of this part of
[27:45] the the loss here uh so you have kind of
[27:47] a stop gradient on the parameters data
[27:49] that's just your language model but now
[27:51] you have this uh this kind of rank
[27:54] function here that you can back propop
[27:56] into right so that's your ranker is
[27:58] basically can be a bir model or anything
[28:00] like that that works on top of the
[28:01] things you initially retrieve from your
[28:03] bm25 and now you have this bir reer
[28:05] ranker that you can backrop into um so
[28:09] this also works really really nice so
[28:11] we're slowly progressing towards having
[28:13] a system that is much more optimized for
[28:16] being properly uh retrieval augmented in
[28:19] a way where it's useful and and
[28:20] contextualized for what you want to use
[28:22] it
[28:23] for um so uh yeah just to point out kind
[28:26] of what that looks like with this ranker
[28:28] so you just have this extra step
[28:29] essentially right so we have our
[28:31] retriever then we have our ranker then
[28:33] we have our generator and our
[28:38] output no not
[28:41] necessarily um so so so for this one you
[28:44] do yeah but so for replug you don't
[28:47] right yeah yeah yeah yeah yeah so
[28:52] basically yeah you need to get do apis
[28:54] provide not all of them um some of them
[28:57] do right but but yeah there are all
[28:59] kinds of tricks you can do on top of
[29:01] that
[29:02] yeah um so
[29:04] so basically the question is how do we
[29:07] get sort of gradients flowing into this
[29:09] right so if you don't actually have
[29:10] access to the full parameters of model
[29:13] so that you can backrop all the way
[29:14] through it then you can uh do a
[29:17] reinforce style loss on on the retrieval
[29:20] and then you just pass the kind of log
[29:22] likelihood if you if you have access to
[29:23] that or some other kind of blackbox
[29:26] function
[29:31] all right so um I the next thing you can
[29:35] do uh is to optimize both the Retriever
[29:38] and the generator um and and so this
[29:41] really uh start starts getting to the
[29:43] the proper kind of contextualization of
[29:45] the entire architecture where you want
[29:47] everything to work together right so
[29:49] rather than having this Frozen thing
[29:50] where everything is basically not aware
[29:52] that the other part exists right it's
[29:54] like two halves of the brain they're not
[29:55] talking to each other one is your
[29:57] retriever that is your language model
[29:58] there's no connection they're just like
[30:00] sort of like something is thrown over
[30:01] the fence and then you hope for the best
[30:03] uh so instead of that we have everything
[30:05] much closer and learning together um so
[30:09] um one of the the first um ways of doing
[30:13] this with a generator uh was rag
[30:15] retrieval augmented generation uh which
[30:17] we did at ver in 2020 um and and it's
[30:22] very similar to what we've already seen
[30:23] we basically have this retriever here
[30:25] that works over different documents you
[30:27] get some score function uh that gets
[30:29] given to this generator um that that
[30:32] generates answer and now you want to
[30:34] backdrop all the way and update your
[30:36] generator as well right so in the
[30:38] previous two architectures we saw you
[30:40] keep the generator fixed you backdrop
[30:42] into your retriever but here we update
[30:45] everything well not exactly everything
[30:47] as you'll see but we'll we'll also
[30:49] update the the part of the Retriever and
[30:52] the
[30:53] generator um so in this rag model uh we
[30:56] actually have two different ways of
[30:58] doing this and this this is probably
[31:00] something that when we talk about this
[31:03] uh if you think about this long enough
[31:04] then you'll you'll think like okay but
[31:06] when actually do I need to retrieve like
[31:08] do I do I retrieve every time I generate
[31:11] a new token or do I just retrieve once
[31:13] and then generate an entire sequence
[31:16] right or maybe I want to retrieve every
[31:18] end uh tokens right so these are hyper
[31:21] prams or maybe I want to learn when to
[31:22] retreat as as we'll see that's also
[31:24] something people have done um so are are
[31:27] two different ways to do it um and and
[31:30] what we do in this paper basic the whole
[31:32] point of the paper is that this Frozen
[31:34] thing doesn't really work all that well
[31:37] right so I think what people Call Rag
[31:39] now is is usually refer refers to the
[31:42] Frozen thing uh but the whole paper
[31:44] basically would never have been accepted
[31:46] anywhere if we had just done the Frozen
[31:47] thing right the whole point of the paper
[31:49] is that you want to uh optimize it and
[31:52] so at my company contextual we call this
[31:55] Frozen thing Frankenstein's monster
[31:57] because it's really like you Cobble
[31:58] together these different pieces right
[32:00] you sort of yeah it's it's really like
[32:02] Frankenstein you just put it together
[32:04] and then it sort of walks you know uh
[32:05] but it doesn't really have a soul it
[32:07] doesn't really actually work it's not
[32:08] the real thing um so that's great for
[32:12] for everyone here I think because there
[32:14] are so many opportunities to do better
[32:15] than what what most people are using
[32:17] right
[32:18] now um so one of the limitations of of
[32:22] the original rag architecture is that it
[32:25] only supports a very small okay but so
[32:28] if you have lots and lots of documents
[32:30] uh then the problem is that you have to
[32:32] fit all of them in the context but how
[32:34] do you really get that uh to fit right
[32:38] so one thing you can do is you you first
[32:41] encode uh things so that you get one
[32:43] single representation or only the few s
[32:46] of top level representations then you
[32:48] concatenate those and then you just feed
[32:50] them to the decoder so this is FID
[32:52] Fusion in decoder um and as you can see
[32:55] the scales to a much higher uh number of
[32:58] of passages uh and that uh leads to
[33:01] corresponding improvements in uh the
[33:04] scores that you care
[33:06] about uh so that's a really cool idea
[33:08] and so so we're we're slowly moving
[33:10] towards more decoder only architectures
[33:13] right so in rag we have this bark model
[33:15] it's sort of an encoder decoder
[33:16] architecture but here you just have this
[33:18] decoder that does some fancy attention
[33:21] over stuff that you retrieved before um
[33:24] and and so another like pure decoder
[33:28] language model architecture um is this
[33:31] one
[33:32] KLM which I think is is very elegant in
[33:35] its simplicity so it's basically you
[33:37] just have a normal language model but uh
[33:40] you interpolate the normal language
[33:42] model weights with uh things that you
[33:45] retrieved um so basically you have some
[33:48] sort of prompt right so like Obama's
[33:50] birthplace is you go to your big Corpus
[33:52] you find similar things you look at the
[33:55] words that come next to the similar
[33:57] things uh you uh rank that thing you
[34:00] sample your top K you renormalize that
[34:03] so now you have a bunch of scores and
[34:05] now you can just interpolate between
[34:07] your retrieved kind of non-parametric
[34:10] memory scores and your parametric
[34:12] language model scores so this is very
[34:14] late Fusion in a sense right you at the
[34:16] very end you combine these two uh and it
[34:18] allows you to re reweight the pure
[34:20] language model probabilities or
[34:22] likelihoods um so this works really well
[34:25] and it scills especially well if you
[34:27] have a huge uh retrieval Corpus so if
[34:30] you have trillions and trillions of
[34:32] tokens in there you could have a much
[34:34] smaller language model that does not
[34:36] that much heavy lifting because you can
[34:37] really rely on this big Source Corpus
[34:40] that you're working from and so that
[34:42] idea was uh exploited by this paper
[34:45] called retro out of Deep Mind where uh
[34:49] they showed that you can have a 25 times
[34:51] smaller retrieval augmented language
[34:53] model trained from scratch so really
[34:55] pre-trained uh entirely from stretch
[34:57] that outperforms this 25 times bigger uh
[35:00] language model on the same data in terms
[35:02] of perplexity which is pretty impressive
[35:05] right so this architecture is much more
[35:06] efficient than a parametric model
[35:09] because you can rely on this external
[35:11] memory so if your external memory is big
[35:13] enough uh you can get pretty huge gains
[35:17] so there was a lot of excitement about
[35:19] retro when it was announced uh but it's
[35:21] a deep mind paper so there's really no
[35:23] open source nothing really to validate
[35:26] that this actually Works um and so very
[35:29] recently there has been a bit of work
[35:31] from Nvidia called retro
[35:33] Plus+ um where they have this hybrid
[35:36] between the Retro architecture and then
[35:39] they do basically Rags sort of they put
[35:41] the top one or the topk results in the
[35:44] context of the language model after all
[35:46] so it's sort of a crossover between Rag
[35:48] and retro and they show some really nice
[35:51] results here but I I think it's sort of
[35:53] pointing to this uh big flaw I think is
[35:56] that why is there still no good open
[35:58] source retro
[35:59] model that probably tells you something
[36:02] about whether it actually really works I
[36:04] I spent a lot of time in my career
[36:06] trying to reproduce deep mind papers
[36:08] that didn't necessarily always work uh
[36:11] and so I I think the the same is true
[36:14] for retro um and that's why we need to
[36:17] do this in context rag on top of retro
[36:19] to actually get it to
[36:21] work but could it just be a true book
[36:24] thing because you're searing onook
[36:28] yeah but so
[36:31] that no so the the doing retrieval over
[36:34] that to over that big Corpus is not that
[36:37] difficult actually yeah um so so they're
[36:40] even like distributed pH packages you
[36:43] can just do everything yourself so yeah
[36:46] so in terms of comput it's it's actually
[36:48] not that hard anymore to to reproduce
[36:50] something like this uh but I've tried
[36:53] several times and it it's not really
[36:55] reproducible
[36:57] so the only way to get it to work is if
[36:58] you do this in context rag on top of the
[37:00] Retro thing and then as you can see here
[37:02] in the results then it actually gives
[37:04] you a gain over the pure GPT model right
[37:06] so it starts from a GPT and then they
[37:08] kind of retrofit as they call it the GPT
[37:12] model so in short I think there's still
[37:14] a lot of work to be done in pre-training
[37:16] these systems really from scratch uh and
[37:18] retro kind of showed that it might be
[37:20] possible but we don't necessarily know
[37:22] exactly how to do it the right way and
[37:24] this is really one of the interesting
[37:26] open
[37:27] questions um any questions on
[37:33] that
[37:38] online no okay then we'll move on um so
[37:45] um let's go all the way with the
[37:47] contextualization now right so so with
[37:50] retro and with rag what we actually did
[37:53] is we only updated the query encoder uh
[37:56] so updating the document encoder is very
[38:00] expensive so one of the first papers
[38:03] actually kind of the the OG of the the
[38:05] non-frozen dense retrieval augmented
[38:07] methods is this uh paper called realm
[38:10] this is really like Visionary work this
[38:12] was basically the first uh uh kind of
[38:16] version that did this properly where
[38:18] they updated it all the way including
[38:20] the document encoder um so can can
[38:23] someone explain to me why it's expensive
[38:25] to update the document en
[38:30] coder so let's say we have a trillion
[38:32] tokens in our Corpus right and now so
[38:36] now we go all the way so we basically do
[38:38] a forward pass we get a gradient at the
[38:40] end now we back propagate the gradient
[38:42] through the retriever we update the
[38:44] query encoder now we have to update the
[38:46] document encoder so what do we then need
[38:48] to do after we've updated the document
[38:50] encoder we need to re-encode the entire
[38:53] internet right so basically every single
[38:56] gradient update we have to re-encode
[38:58] whatever our index is which so if this
[39:01] is like trillions of tokens it's like
[39:02] re-encoding the internet after every
[39:04] batch update so that's not very
[39:12] efficient
[39:15] change
[39:17] Stuff AC have
[39:20] some
[39:23] predictable
[39:25] yeah
[39:27] yeah that's one one way to do it uh so
[39:29] so there there are a bunch of different
[39:30] ways to update the the document encoder
[39:33] so what they do in realm is they
[39:35] basically do it for Te batches then they
[39:39] stop they re-encode the entire internet
[39:41] and then they train again uh so it's
[39:43] sort of asynchronous updates they have
[39:45] this very fancy sort of sharding
[39:47] mechanisms where they take down uh
[39:50] certain parts of their entire index uh
[39:52] and then update them kind of on the Fly
[39:55] uh so you can do it is just very
[39:57] expensive so one one of the things that
[39:59] a lot of people have been thinking about
[40:00] not exactly theora idea but but similar
[40:02] versions of that um are around like can
[40:06] can you make it more efficient so that
[40:07] you don't have to do do this
[40:11] asynchronously um so one of the
[40:13] downsides of this realm uh architecture
[40:16] is that it's really just a bird model
[40:18] but then you do this retrieval
[40:19] augmentation on a bird model with other
[40:21] bird models so it's not really
[40:22] generative it's not really gen in the
[40:25] modern Paradigm but if you want to read
[40:27] like one paper uh on this topic like
[40:30] this is a very good one to
[40:31] read uh the other one that is is really
[40:34] really good to read uh is this paper
[40:37] called Atlas uh so Atlas is um uh so
[40:41] this is out of fair um with a bunch of
[40:44] folks the folks who did like Rag and the
[40:46] folks who did FID and uh a really a
[40:49] brilliant set of people and and this is
[40:51] really a comprehensive uh analysis of
[40:54] everything that's happening in this Arch
[40:56] ecture so the first question they really
[40:58] look at is how do we train this
[41:00] retriever so we've seen a couple of
[41:01] versions of this um but uh which one
[41:05] actually works better they haven't
[41:06] really been compared in a head-to-head
[41:08] setting uh so one thing is we have this
[41:10] FID Styles s vention distillation uh so
[41:14] that's really too complicated to go uh
[41:16] into detail here but the others are
[41:18] actually very simple um so one is this
[41:21] loss we've basically seen before right
[41:24] uh so we've seen this I think with the
[41:26] in context rag one right so we have a
[41:28] stop gradient on the language model and
[41:30] then we update the retriever the other
[41:32] one is what we've seen with replug so
[41:35] this is basically exactly the replug
[41:37] loss right so we have the K Divergence
[41:39] of the um the documents and and sort of
[41:43] the Improvement that you see when you
[41:44] give it that document uh the other thing
[41:47] they have is basically the inverse of
[41:49] that one so if I take this one document
[41:52] out how does that affect my uh my
[41:55] perplexity of the language model right
[41:58] um and so this one I think is actually
[42:01] quite elegant because that really gets
[42:03] to like how valuable is this one single
[42:05] document for me answering this question
[42:08] correctly um so uh they compare all of
[42:12] these different versions and uh what you
[42:14] can see is that uh the the kind of
[42:17] replug style loss and this leave one out
[42:19] loss they perform a lot better than all
[42:21] of these others so this fixed retriever
[42:23] or no joint pre-training these are
[42:25] really kind of the Baseline sort of
[42:27] Frozen rag models or close book uh and
[42:30] as you can see you can do really a lot
[42:32] better uh if you optimize things and so
[42:35] this leave one outing is probably the
[42:38] best I would say um so then the other
[42:40] question is how do you actually like
[42:42] train that entire system like what data
[42:44] or what tasks do you train this on so
[42:46] they also uh experiment with a bunch of
[42:49] different versions uh so one is uh doing
[42:52] prefix LM if you're familiar with that
[42:54] uh so they basically take a chunk that
[42:57] occurs somewhere on the internet and
[42:59] then they predict the next Chunk from
[43:02] that chunk right so it's really like
[43:04] sentence to sentence so maybe like skip
[43:06] thought back in the day but now you have
[43:08] this retrieval step where you predict
[43:09] the next sentence uh then they just do T
[43:13] T5 Styles sort of D noising so that's
[43:15] Mass language modeling if you're
[43:16] familiar with T5 um and then they have
[43:19] this title to section generation piece
[43:21] so um I think the takeaway from this
[43:23] table is basically that whatever you do
[43:25] here so they're using T5 model so
[43:28] whatever you do here needs to be the
[43:29] same that your uh language model expects
[43:32] um so for T5 that's T5 style
[43:35] loss um and then uh the the the next
[43:39] sort of final question that they look
[43:40] into going back to to what we talked
[43:42] about how exactly do we update this
[43:45] retriever uh so do we have to update the
[43:47] document encoder or do we maybe have to
[43:50] do some sort of reranking uh or do we
[43:52] maybe just update the query um and and
[43:55] quite surprising L I think they find
[43:57] that just updating the query so like in
[43:59] the original rad paper is actually
[44:01] already basically good enough in many
[44:04] cases so so that's nice because it's
[44:07] much more efficient if you don't have to
[44:08] update your documents all the time uh I
[44:11] think the the real question here though
[44:13] is like uh how good is your document
[44:15] representation to begin with so you need
[44:18] to have very very high quality embedding
[44:20] model for this to work if you don't have
[44:22] that then this will not work but if you
[44:24] do have that then you get a very nice
[44:26] kind of query side fine-tuning
[44:31] thing U so the the atlas paper is about
[44:35] trying to do F shop um sort of language
[44:38] modeling tasks so it's how how many
[44:40] examples are given in the
[44:45] context um yeah so so the main takeaway
[44:49] um here is that if you compare like the
[44:51] Close book equivalent model to the
[44:53] retrieval augmented model uh you see
[44:56] very big
[44:58] improvements that's really the only
[45:00] takeaway of of this entire
[45:02] section um but I I think that that's
[45:06] really saying something uh in terms of
[45:09] what we should be thinking about um how
[45:11] how much time do I have
[45:14] in
[45:15] still okay okay all right other
[45:21] questions are the documents in the
[45:24] training step same as
[45:29] yeah so they can be different um in so
[45:33] in Atlas the athlet basically tries
[45:35] everything uh so they also try to see
[45:37] what happens if I train this on
[45:39] Wikipedia But I swap in like a sort of
[45:42] Comm and crawl index um and I think so
[45:45] in Atlas but also in retro domain
[45:47] finding is just the more the better uh
[45:50] so it's really just like the bigger your
[45:52] index the more likely you're you are to
[45:54] find the exact right thing um and then
[45:58] make the right
[46:04] prediction any other questions on this
[46:07] oh yeah uh sorry I this is a question
[46:09] about the generator in the I guess uh
[46:12] the rag system so um recently I saw a
[46:17] paper on mistal 7B so it introduces a
[46:20] lot of these uh new architectural
[46:22] changes like the sliding window
[46:23] attention to handle longer sequence is
[46:26] at a smaller cost and the group query
[46:28] attention for faster inference I'd like
[46:30] I'd like to like know your thoughts on
[46:33] designing a generator specifically for
[46:36] rag uh leveraging for example where
[46:38] mystal 7B currently is because for
[46:41] example like the sliding window
[46:43] attention I could see how that could be
[46:44] adapted to the rag
[46:47] case yeah so so maybe your read on sort
[46:49] of what makes mol's special is a bit
[46:52] different from mine so I I don't think
[46:53] that the sliding attention window thing
[46:55] is actually that interesting the reason
[46:57] mrol works so well is because it's
[46:58] trained on a lot of data uh and you can
[47:01] do that more efficiently because you
[47:02] have sliding window attention so you
[47:03] don't need to attend to everything um
[47:07] but uh so to answer your question I I
[47:10] guess you're asking sort of about the
[47:11] architecture of the generator if you
[47:14] know that there's going to be a
[47:15] retriever so I I I think uh that's
[47:18] basically what retro tried to do right
[47:20] so um retro actually some of the people
[47:24] on the Retro paper are at Mistral now uh
[47:27] so they they have this uh C chunk cross
[47:30] attention idea here so you basically
[47:32] have a language model but the way it
[47:34] does the tension over the things you
[47:36] retrieve in your retro um architecture
[47:41] uh you they they kind of get integrated
[47:43] into a model not using the standard
[47:45] detention mechanism but using this
[47:48] slightly different chunk cross
[47:50] detention oh okay so I think the the
[47:53] sliding window Attention Point I was
[47:55] trying to get get at was that uh it uses
[47:57] a fixed window so that whenever you're
[48:00] doing the query key computation in the
[48:02] attent with the query vectors and the
[48:04] key vectors you're using a fixed window
[48:07] attention so I think my idea was to
[48:10] actually one use a dynamic window
[48:13] because for example the rag case um if
[48:16] you use a fixed window when you're doing
[48:18] attenion it it is possible that you
[48:21] actually are leaving you you're only
[48:23] looking at a fixed uh span of
[48:26] information so if you could maybe adapt
[48:28] mistel so that you could make it better
[48:31] for the ride case and and for example
[48:33] the making the fixed window size the
[48:35] dynamic window uh yeah yeah I think it's
[48:39] an interesting idea so so for me uh the
[48:42] the what m is doing with with the
[48:44] sliding window that's basically like a
[48:46] conet right so we had all these
[48:48] convolutional like light comp Nets where
[48:51] where we would have word embeddings and
[48:52] you would do convolutions over it and
[48:54] then pull uh and then you would still
[48:56] get the information out so it's not that
[48:58] the sliding window prohibits you from
[49:00] looking earlier it's just that that
[49:02] happens higher up in your Transformer
[49:04] sort of yeah
[49:07] yeah so I think that definitely is an
[49:10] interesting direction to to think in
[49:12] yeah yeah so I think um it's like not
[49:15] too crazy to say are there any
[49:17] architectural changes that we can
[49:19] introduce into these 7 billion parameter
[49:21] models so that they could be better
[49:23] adapted to the rag case
[49:27] yeah so there there there might be yeah
[49:30] I I think one one question is just how
[49:32] do you how do you do the attention over
[49:33] things you've retrieved which I think is
[49:35] what
[49:37] you're yeah
[49:39] thanks so just to make sure I understand
[49:42] so I mean in this retro model you're
[49:45] retrieving in each
[49:47] block and when you talk about putting
[49:50] the retrieval in the context are you
[49:53] saying that you only do it at the
[49:54] beginning you don't do it
[49:57] yeah so so in context so this is it's
[50:00] not exactly every layer sort of so it's
[50:02] every token right so every um every step
[50:05] basically not every block so doesn't
[50:09] make sense so it's not every layer that
[50:12] you do to retrieval yeah so every step
[50:16] right um so so this is kind of like like
[50:19] what rag token is so you retrieve every
[50:21] token you so you generate and then you
[50:24] can retrieve again or in the case of
[50:26] retro you can generate like a chunk and
[50:28] then you retrieve chunks again uh if you
[50:31] look at the in context case you retrieve
[50:33] once at the beginning and then you give
[50:36] it you're say that during this
[50:41] nobody yeah but so the so the in Contex
[50:44] thing um so so here you don't actually
[50:48] give it as context at all like directly
[50:51] to the model right so here you get you
[50:53] let the decoder kind of tend over
[50:56] it
[51:02] yeah so I don't think cross attention
[51:05] really works yeah
[51:10] yeah other
[51:13] questions yeah we
[51:15] inside the the training of the retriever
[51:18] is not so necessary because of the
[51:21] large uh so I'm wondering what inside of
[51:24] the T like what cases are really need
[51:29] toiz update or anyway updates
[51:34] those yeah so you do want to update the
[51:36] retriever right but but only part of the
[51:38] retriever is necessary to be updated for
[51:41] a lot of these these cases um but so so
[51:46] I I think it uh so these are very
[51:48] specific data sets right natural
[51:50] questions wizard of Wikipedia and fever
[51:52] so they're really very uh kind of
[51:54] knowledge intens tasks uh so in that
[51:57] case if you already have a very good
[51:59] system like DPR that is specifically
[52:01] pre-trained for those tasks then you
[52:04] only need to update the query encoder
[52:06] but so I would expect that if you move
[52:08] Beyond this to kind of General language
[52:10] modeling things like like retro then you
[52:13] probably do want to update the document
[52:15] encoder at least in a way where you can
[52:17] scale
[52:18] it so that in the this part that is very
[52:23] much in
[52:33] as long as we have a good opal knowledge
[52:36] of what of the maybe the documents by
[52:39] those uh good
[52:43] models yeah but so you need to learn how
[52:45] to kind of query into that Index right
[52:48] so if you if you don't do that uh then
[52:51] then yeah you don't get really good
[52:53] performance so that's sort of like your
[52:54] close book performance right if you just
[52:57] have the language model and you're just
[52:59] like what what does the parametric model
[53:01] on its own without the retrieval what
[53:03] does it actually know as you can see
[53:05] there there are pretty big gaps there
[53:11] right other questions otherwise I will
[53:14] cover other
[53:17] questions no uh hello yeah go for it a
[53:21] quick question like so uh what about
[53:24] like more here at retrieval like I
[53:26] suppose there will be messes trying to
[53:28] not just retrieve a single chunk but
[53:30] some kind of like groups of chunks or
[53:31] something or summarized versions there
[53:34] there's been some interesting work on on
[53:36] doing that uh where you first tried to
[53:38] find so you can have multiple indices
[53:40] and they can kind of cascade right so
[53:41] first you want to find the relevant
[53:43] document so you have some document
[53:44] representation and then within that
[53:46] document you want to find the relevant
[53:48] chunk uh so you can do it sort of that
[53:50] direction you can also do it in reverse
[53:52] I think I I have something on the slide
[53:54] there where you can find the chunk and
[53:56] then sort of expand uh the context
[53:59] around it and then give that to the
[54:00] language model um so I think yeah there
[54:04] are all kinds of interesting things you
[54:05] can do
[54:07] there cool H thanks I guess another
[54:10] thing just like do can you compare rag
[54:13] versus like long context L efforts so
[54:16] there are lot of things like on around
[54:18] just having really long context and
[54:20] extreme it could replace rag but I know
[54:22] like if your takes yeah so so my my uh
[54:26] so everybody understands this question
[54:28] right so there there's there's a trend
[54:30] where we want to have very long context
[54:32] language model so that basically you can
[54:34] like take Harry Potter or something just
[54:36] put it into context and then ask a
[54:38] question like what is the name of like
[54:40] Harry Potter's owl or something right
[54:42] and then it can just attend over the
[54:43] entire thing um so attending over all of
[54:47] Harry Potter to answer that one question
[54:49] is super inefficient right uh so most of
[54:52] Harry Potter has nothing to do with the
[54:54] AL uh so but you are still kind of
[54:56] reading it if you do it with the long
[54:58] context window um so that's why I think
[55:01] the doing it the rag way where you have
[55:02] this non-parametric component is a much
[55:05] more efficient way to solve this problem
[55:07] and if you actually look at the
[55:09] literature on Long context Windows uh
[55:11] the way they they solve the problem of
[55:14] scaling the attenion mechanism is by
[55:16] making it very sparse uh so they're
[55:19] basically turning it so that's a
[55:20] different kind of spars but they're
[55:22] turning it into a non-parametric
[55:23] retrieval problem uh kind of behind the
[55:26] scenes so they're not they're not
[55:27] actually all that different if you want
[55:29] to scale long context then you're going
[55:30] to move towards a rag style
[55:34] architecture good
[55:38] thanks all right um so let's talk about
[55:41] some other interesting questions so one
[55:44] thing and I already alluded to this is
[55:47] when do we actually retrieve so very if
[55:49] we're doing like if we want to uh like
[55:51] retrieve every token that's also very
[55:54] inefficient because I probably don't
[55:56] have to retrieve to generate
[55:58] the right I can probably do that on my
[56:00] own with the language model is of a
[56:02] wayte to go and retrieve stuff but if I
[56:05] only retrieve once at the beginning of
[56:07] the sequence that's probably also not
[56:08] great right so so what we ideally want
[56:11] to be able to do is to say okay
[56:13] sometimes I want to retrieve sometimes I
[56:15] don't want to retrieve and I'm going to
[56:16] learn when I want to kind of expend the
[56:19] the compute Budget on doing the
[56:21] retrieval um so a nice paper where they
[56:24] have a stab at this is called flare for
[56:26] active retrieval augmentation where they
[56:28] basically have the language model decide
[56:31] uh when it should do a search and what
[56:33] it should do to search for um so so I I
[56:37] think this fits in a general Trend that
[56:39] you can see in the field around kind of
[56:41] Agents right so we can talk a little bit
[56:43] more about that too um so this other uh
[56:47] question that that I think we also kind
[56:49] of covered already here is how do we
[56:51] train this at scale right so we can do
[56:52] these asynchronous updates we can do
[56:54] reer rankers we can do query side only
[56:57] there's this really nice paper uh which
[56:59] is quite close I think to the idea you
[57:01] proposed uh where you first use bm25 to
[57:05] create a a batch basically where
[57:07] everything is very similar uh in terms
[57:10] of what you've retrieved and now you uh
[57:13] have this kind of inbatch update so it's
[57:16] it's sort of like a ranker where you
[57:17] encode the information that is just in
[57:19] your batch using this other model and
[57:22] now you can update this model on the fly
[57:24] so you don't have to worry too much
[57:25] about doing the full kind of documents
[57:27] side update um and again here what
[57:30] really matters is like how big is your
[57:32] index if you have an amazing index you
[57:33] can basically solve any problem just by
[57:35] looking it up right so rather than
[57:38] cramming it into your parameters you can
[57:40] just find it
[57:43] um this is a really nice paper uh called
[57:46] Silo so one one of the interesting
[57:48] things I think that's going to happen in
[57:50] the next year or two around language
[57:53] models is there and you've seen this
[57:54] already there's a bunch of like lawsuits
[57:56] against open Ai and other places around
[57:58] where does the data exactly come from um
[58:02] so one uh very elegant solution I think
[58:04] is to have a rag system that you train
[58:06] on data that you know is safe so you can
[58:09] train that thing on Wikipedia But now
[58:12] during test time you can give it a data
[58:14] store that has maybe slightly riskier uh
[58:17] information in it so this massive index
[58:20] of all the stuff on the internet
[58:21] including some things that are maybe um
[58:25] risk uh you can still have them in your
[58:27] index but your language model uh your
[58:29] retrieval augmented language model I
[58:31] should say you know that that thing is
[58:33] safe because it was strin on data that
[58:34] is public domain uh so that's what they
[58:36] do in Silo and they show that that works
[58:38] really well so that's uh one possible
[58:42] solution to to a lot of the the kind of
[58:44] compliance and legal risk around
[58:45] language model
[58:48] deployments um there's a great paper and
[58:51] also from one of your colleagues um
[58:54] around uh contexts getting lost in the
[58:57] middle I think this is also kind of a
[58:58] fascinating phenomenon this is on a
[59:00] frozen rag system um but U language
[59:05] models are very similar to humans in
[59:07] what things they pay attention to so if
[59:09] you give them a bunch of things that you
[59:11] retrieved what what they will look at
[59:13] are like the first things you list and
[59:15] the last things you list and they will
[59:16] sort of ignore the middle um so if it
[59:19] actually respected the rank function
[59:21] then then this curve would go down all
[59:23] the way right but it sort of go goes up
[59:26] um so I I I think that's a a very
[59:28] interesting observation which kind of
[59:30] shows that how brittle uh these these
[59:33] systems can be right so if you have a
[59:35] frozen rag system it can be very very
[59:37] brittle where like the order of the
[59:39] retreat context matters a lot in whether
[59:41] you get the right answer or
[59:44] not work on treating this as re problem
[59:48] sense
[59:50] ofor like specifically going for
[59:53] interpration out VOR that's going to
[59:56] inter prodct with just the right maybe
[60:00] you can tune for the particular
[60:04] dat yeah so what what I just described
[60:06] someone asked like how how do you
[60:08] actually so I said there are other ways
[60:10] to do this and then the question was how
[60:12] do you do that so the way you do that is
[60:13] using reinforce um so yeah there has
[60:17] been work on doing that um so some of
[60:20] the older papers were playing with this
[60:21] but one one of the big problems with uh
[60:25] so I think the replug solution isort of
[60:27] more elegant uh for solving that problem
[60:31] because you actually of use signal from
[60:33] the language model and if you just do
[60:34] reinforce it's very high variant so
[60:36] you're uh it's it's going to be super
[60:38] finicky if you don't want to destroy
[60:40] your
[60:42] index but people have tried it
[60:47] though um so um uh there's some some
[60:51] really nice work from open AI where they
[60:54] they basically basically show and again
[60:55] we're sort of like thinking more and
[60:57] more about agents here right uh where
[61:00] they show something very similar to the
[61:02] flare result from earlier with active
[61:03] retrieval that doesn't necessarily have
[61:05] to be some index that you own it can be
[61:07] just some some web search right um and
[61:10] obviously in this case you don't really
[61:12] have access to the web search
[61:13] necessarily so Bing or whatever they use
[61:15] here is not going to update its
[61:17] parameters uh but I just wanted to kind
[61:19] of put this in your mind like this is
[61:21] another thing you can do right and if we
[61:24] take this really to the general form uh
[61:27] then you can think of language models as
[61:29] just tool users um so rather than just
[61:32] retrieval augmenting language models we
[61:34] can tool augment language models and
[61:36] retrieval is just one of the many tools
[61:38] that language models have access to we
[61:40] can have uh rankers and things on top of
[61:43] the outputs of these tools um and so one
[61:45] of the the big questions I think uh is
[61:48] how do you actually get the system to to
[61:50] learn stuff right so we're going to need
[61:52] our help if we want this system to
[61:54] really learn learn how to take these
[61:55] actions uh
[61:57] properly
[61:58] um um and and so yeah this has been
[62:01] taken to to the extreme in this uh sort
[62:04] of self rag architecture where they have
[62:06] this sort of retrieval step and it's
[62:07] active and then you criticize it and
[62:09] then you uh basically do some natural
[62:11] language inference uh and all of that
[62:13] just with one language model to answer
[62:16] uh the
[62:17] questions um so the other missing piece
[62:20] so I'm just kind of going through a
[62:22] bunch of open questions uh that that
[62:24] people have looked at uh but feel free
[62:26] to interrupt me if there's anything you
[62:27] want to know um but so instruction
[62:30] tuning we established at the beginning
[62:32] of the lecture that this is pretty
[62:33] important for getting things to work so
[62:35] fixing the user interface um but the
[62:39] instruction tuning has almost always
[62:41] only happened on the language model and
[62:43] not on the entire system so I think one
[62:45] of the interesting uh things that people
[62:47] are looking at now with with things like
[62:49] RIT and instruct retro is how can we
[62:51] instruction fine to an entire retrieval
[62:53] augmented system so all the way into the
[62:55] retrieval step can we generate data so
[62:58] that that also follows the instructions
[63:00] properly which currently doesn't happen
[63:02] in any of these model
[63:04] architectures um and then finally I I
[63:07] think I would be remiss if I if I didn't
[63:09] really talk about what people call
[63:11] Advanced rag so so like the developer
[63:13] Community has been really doing some
[63:15] awesome stuff uh so like Frameworks like
[63:18] llama index and Lang chain and there's
[63:19] all these open source Vector databases
[63:21] like groma and wv8 and they're all sort
[63:24] of about making rag really easy but this
[63:26] is all Frozen rag right but even with
[63:29] frozen rag you can really do incredible
[63:31] things um so uh we mentioned some of
[63:34] these already so child parent recursive
[63:36] retriever so you find small small parts
[63:38] and then you give the big parts around
[63:40] it to the language model you can do
[63:42] hybrid search where we use reciprocal
[63:44] rank Fusion so we have like different
[63:45] search results that we then combine
[63:48] before we give the final thing to the
[63:49] language model there's zero shot like
[63:52] large language model ranker so basically
[63:54] the score function is not doesn't come
[63:56] from your retrieval it comes directly
[63:58] from the language model um and then uh
[64:01] hypothetical document and Bets which I
[64:02] think is a really cool idea so you just
[64:05] uh basically you fix hallucination
[64:07] through hallucination uh so you get a
[64:10] question then you let the language model
[64:12] hallucinate a bunch of possible answers
[64:14] then you go and search for nearest
[64:16] neighbors to the possible answers and
[64:17] you give those as context and then it
[64:19] gives the right answer based on that
[64:21] right so it's really like hallucinating
[64:23] answers and I think it's a brilliant
[64:26] solution um so there's a lot of stuff
[64:28] happening in in the kind of Frozen rack
[64:31] Community uh to that I think is very
[64:33] interesting to look at um so uh just to
[64:37] wrap up kind of looking at the future of
[64:40] this stuff uh there are still lots of
[64:42] very interesting open questions so if
[64:44] you're a student thinking about how to
[64:46] solve any of these I think you can have
[64:49] quite a lot of impact um so how how
[64:53] exactly do we do like pre-training of
[64:55] this architecture and do we even need to
[64:56] pre-train I think even retro kind of
[64:59] shows that you don't necessarily have to
[65:00] pre-train so but maybe there's something
[65:02] wrong with how we um how we do that what
[65:05] do skating laws look like so I think
[65:07] there's a really interesting question
[65:08] here around if I have a huge index and a
[65:11] very rich encoder of all the information
[65:13] in that index maybe I can move so
[65:16] basically decouple all the memorization
[65:18] to this index so I have a language model
[65:20] that doesn't know anything it just
[65:22] speaks English it just sort of re on top
[65:24] but it has no knowledge because that
[65:26] always comes from this retriever if you
[65:28] can do something like that then you get
[65:29] very interesting scaling tradeoffs right
[65:31] so you can have a tiny language model
[65:33] and and do your retrieval uh to do a lot
[65:36] of the heavy lifting with your retrieval
[65:38] which is nice because that's a cach
[65:40] computation right so you can just you
[65:42] already have the the embeddings you just
[65:44] need to do the dop product so it's much
[65:46] more efficient than kind of self
[65:48] attention in the language model um can
[65:51] we move Beyond bu encoder so Vector
[65:53] databases um I I like people who build
[65:56] Vector databases but I'm not sure how
[65:58] long we're going to keep Vector
[66:00] databases um because u i I think rer
[66:04] rankers probably work just as well and
[66:06] bm25 is much more efficient than a
[66:08] vector database um so I I don't really
[66:13] see why we need dedicated Vector
[66:15] databases and so what we're seeing but
[66:17] maybe this is a bit of a critique of uh
[66:20] maybe silicon value investment
[66:22] strategies and things like that but a
[66:23] lot of these
[66:24] um um Vector database companies are
[66:27] basically becoming database companies
[66:28] now so they are adding all this Spar
[66:30] stuff because the the densing is not
[66:32] enough um and as it turns out there are
[66:34] a lot of pretty good uh sparse databases
[66:38] out there already like postgress and
[66:39] things like that and they're also all
[66:41] adding vectors uh to their databases so
[66:45] uh I think that's all going to kind of
[66:46] coales into
[66:50] databases um so um I think there are so
[66:54] interesting things to look at for kind
[66:56] of the data so alluding to this
[66:57] instruction problem can we generate much
[67:00] better data for training rag systems
[67:03] synthetically uh and then I think
[67:05] there's this massive open question
[67:06] around how we actually measure whether
[67:08] the rag system is any good so right now
[67:10] we just look at Downstream performance
[67:13] um um which is sort of okay but if you
[67:15] mess up the retrieval it's very hard to
[67:17] measure um but how to how to measure
[67:20] whether your retrieval is right is also
[67:22] very difficult so there are some
[67:23] Frameworks where they try to take like
[67:25] the harmonic mean of your retrieval
[67:27] accuracy and your language model
[67:29] accuracy uh but I think those are also
[67:31] very shy because we don't really have
[67:33] very good uh data sets to measure that
[67:35] on so I think that's that's a very cool
[67:37] problem to work on as well um so the
[67:41] other problem that I personally am
[67:43] always very excited about is
[67:45] multimodality um and so why would we
[67:48] stop with rack systems with just text
[67:51] right so you can do the same thing with
[67:53] images uh you can augment language
[67:55] models with vision so we did this work
[67:57] on lens where we have a language model
[68:00] enhanced to see uh where you can just
[68:02] give kind of a computer vision pipeline
[68:05] just like a retrieval Pipeline and give
[68:07] that to a frozen language model and pass
[68:09] it to the context and that system
[68:11] actually is an amazing visual question
[68:13] answering system it's close to
[68:15] state-of-the-art uh sort of flamingo
[68:17] from Deep Mind which is also very hard
[68:19] to reproduce because there's no open
[68:21] source version of that um
[68:24] so so we've done some early work on this
[68:26] in in 2021 uh where we have this cross
[68:29] modal retrieval and there's some uh more
[68:32] recent workout of fair where they also
[68:34] look at this so I think that's really
[68:36] like if you look at the trend in the
[68:37] field like multimodality with GPD 4V and
[68:40] things like that is really a Hot Topic
[68:41] so everything is kind of going in that
[68:43] direction uh so it's an interesting
[68:45] thing to think
[68:47] about um so overall I think um it would
[68:51] be nice if everybody sort of moves away
[68:53] from from rag 1.0 to Frozen Frankenstein
[68:56] Rag and moves towards this much more
[68:58] kind of optimized version rag 2.0 so
[69:01] it's really about systems over models
[69:03] right it's not just your language model
[69:05] and your Retriever and they're kind of
[69:06] separate it's about thinking from the
[69:08] from a systems perspective about the
[69:10] entire thing and the problem you're
[69:11] trying to solve and so I think that
[69:14] really is the way that in deep learning
[69:16] things have always progressed where if
[69:17] you optimize the system end to end
[69:20] that's always going to win out like back
[69:21] in the day in computer vision or NLP we
[69:23] have like parsers and scam parsers and
[69:25] all this kind of stuff and all that just
[69:27] doesn't exist anymore now because we
[69:30] optimize the system end to endend U so
[69:32] that's what's going to happen here too U
[69:35] so if we take that to the extreme like
[69:36] there's a chunker thing in your
[69:38] documents right like put cutting it up
[69:39] into pieces like you could backdrop into
[69:41] that like why not somebody should really
[69:44] do that um and so yeah I I think like
[69:48] trading off cost and quality uh and zero
[69:50] shop domain generalization that's really
[69:52] like where this stuff is going to come
[69:53] in so language models right now they're
[69:55] amazing but very often they're way too
[69:57] expensive for being deployed somewhere
[69:59] where you can actually make money from
[70:01] them if you're in a company um so what
[70:03] you want to do is make it much more
[70:05] efficient and have the right cost
[70:07] quality tradeoff and the the easiest way
[70:09] I can think of is to do it through
[70:10] retrieval augmentation but obviously I'm
[70:12] I'm very biased um so uh yeah that that
[70:16] was all I had actually um so if you're
[70:18] interested in this I'm I'm at Stanford
[70:20] so I can work with you on research
[70:23] projects on these topics or if you want
[70:25] you can also join contextual because we
[70:27] work on this stuff every day thank
[70:30] you well um sorry I had a question from
[70:35] earlier yeah I think you said something
[70:37] really uh really I think really super
[70:40] helpful earlier about Mel 7B you talked
[70:42] about you compared the sliding window
[70:44] attention to convolutional neural
[70:46] networks and I do see the parallel
[70:48] because with convolutional neural
[70:49] networks you have uh several layers of
[70:51] several different layers of
[70:52] convolutional layers and the top
[70:54] convolution layers are able to see um a
[70:57] larger receptive field than the bottom
[70:58] convolution layers and um and with
[71:01] convolution layers you're able to tune
[71:03] the um filter sizes and the stride so
[71:07] you're able to see a different receptive
[71:09] field and I was wondering if you could
[71:11] see that same innovation in mistal 7B by
[71:14] tuning um because you have different
[71:16] Transformer layers and each Transformer
[71:18] layer will have a span over a different
[71:19] set of tokens and if you can tune I
[71:21] guess the Transformer architecture the
[71:23] way you tune those convolution layers
[71:25] the filter sizes the receptive field
[71:27] perhaps we can do some optimization in
[71:29] the Transformer realm that we have
[71:31] already done in convolution layers yeah
[71:34] I I think that so that's a good idea
[71:36] there's there's a great paper on light
[71:38] convolutions I think from Michael Ali
[71:40] and David G and a bunch of people where
[71:43] it's basically uh this this came out at
[71:46] exactly the same time as the Transformer
[71:48] and the Transformer is slightly more
[71:49] optimized for GPU computation but the
[71:52] the computional model was actually
[71:54] slightly better than the Transformer um
[71:57] so I it's definitely worth exploring
[72:00] okay cool
[72:04] thanks advant the re ranker
[72:07] with that does that
[72:12] advantages TR that yeah so it depends on
[72:15] the problem I I I think what you
[72:17] probably want to do is is sort of cast a
[72:19] white net with bm25 and then just narrow
[72:23] it down with then search uh so you you
[72:25] often see that kind of as a two-stage
[72:27] process where the first one is kind of
[72:28] noisy you can add noise actually to your
[72:31] retrieval and then you use the dense one
[72:33] to filter it
[72:35] down yeah everyone's trying to maybe
[72:39] adap their models to
[72:42] own domain specific area like I think
[72:46] there are many two ways project one way
[72:48] is to use instru tuning in learning way
[72:52] or B tuning like
[72:53] meth and another way is just the main
[72:56] topic of this lecture is using rual or
[73:01] so I'm Wonder besides the low cost
[73:03] advantage of theal AED way do you think
[73:07] the capacity or the quality of augmented
[73:11] can be with those
[73:13] T learning yeah so I I think actually
[73:17] what what's going to happen is that all
[73:19] of this will come together right so so
[73:22] if you train things like end to end rag
[73:25] 2.0 style then you can also fine-tune
[73:27] that system on some use case end to
[73:30] endend right so what why would you just
[73:33] take the retrieval augmented system if
[73:35] you can also F tune it on the thing you
[73:37] care about so I think in the end
[73:38] everybody's going to do all of those
[73:40] things and then there's questions like
[73:42] how do you do that efficiently so that's
[73:43] why you would use adapter or things like
[73:48] that think there was another
[73:52] question I'm curious about Hardware you
[73:54] say it's going to become database kind
[73:56] of thing respons database but what about
[74:00] retrieval hardware and you SM because
[74:05] we've thought so much of the you know
[74:07] the Le part but what about because it's
[74:11] hug trillions said so you have any ideas
[74:15] just a database problem so I don't know
[74:17] if I'm allowed to say this exactly
[74:19] actually but uh so one of the the
[74:23] biggest chip manufacturers that recently
[74:26] their stock has done really well they
[74:27] have some dedicated retrieval Hardware
[74:30] coming out I think soon or it might
[74:31] already be
[74:33] out um so yeah so yeah that
[74:37] like very efficient uh dense retrieval
[74:40] is a very big
[74:46] business are
[74:51] questions Sol
[74:58] um yes I I think I think so if you take
[75:01] it to the extreme so one of the big
[75:03] problems right now is that that if you
[75:05] contextualize an existing language model
[75:07] that already
[75:08] hallucinates then then it's going to be
[75:10] kind of hard to get rid of the
[75:11] hallucination right so if you do replug
[75:13] on
[75:14] gp4 gp4 might still hallucinate so you
[75:18] it could basically just ignore all the
[75:19] stuff you retrieved and just do whatever
[75:21] it wants anyway uh so that's one of the
[75:23] reasons why you want to train the system
[75:25] end to end and if you take that to the
[75:26] extreme where like I said right if you
[75:28] can just have the language model only
[75:31] reason and speak so it knows English and
[75:33] reasoning but it has no knowledge which
[75:35] all comes from somewhere else then then
[75:38] you can't lose an so it's really all
[75:40] grounded in whatever is in your
[75:47] index but they're so they're they're
[75:49] about hallucination I I'm sort of
[75:51] frustrated that a lot of people in the
[75:53] field misunderstand what hallucination
[75:55] even means right so a lot of people are
[75:57] conflating hallucination with
[75:58] correctness or incorrectness so they're
[76:00] like oh the model made a mistake it
[76:02] hallucinated it's like no it made a
[76:04] mistake that's different from
[76:06] hallucination hallucination I think is
[76:07] very specific kind of I retrieved
[76:10] something so I have some sort of
[76:11] counterfactual ground truth and what I'm
[76:14] saying uh does not correspond to that
[76:16] ground
[76:17] truth um and so yeah I think there's a
[76:22] bunch of folks that stand for also
[76:23] working on better like measurements of
[76:25] hallucination and definitions and things
[76:27] like
[76:30] that understanding correctly your of
[76:33] hallucination only sense in
[76:36] cont yeah of some ground truth right so
[76:40] so Hallucination is is really like there
[76:43] there is something that is true right so
[76:45] so if we're talking about like
[76:47] hallucination yeah so if we're talking
[76:48] about just general parametric language
[76:50] models then sort of the ground truth is
[76:52] whatever we can consider to be true
[76:56] right but we had to word for like
[76:59] language models making mistakes before
[77:01] it was called making
[77:06] mistakes yeah
[77:08] ground I guess you're solving the house
[77:12] question on that path are you working on
[77:15] on
[77:17] ground you
[77:19] know never been president everything
[77:26] this yeah so so I I like the sort of
[77:29] Silo mention there as well so I I think
[77:32] the whole point is that you can you can
[77:35] have different indices and different
[77:36] definitions of ground truth and so um I
[77:39] think you could say I only trust the
[77:42] archive or I only trust like peer review
[77:44] papers and not just archive uh and so
[77:47] you can make decisions in your
[77:49] architecture during test time about what
[77:50] You' Define as ground truth
[77:53] and I also think actually that uh and
[77:57] there's a bunch of work I think
[77:58] happening on this right now you can
[77:59] control for how how grounded you want to
[78:01] be in your ground TR so uh that's
[78:05] another kind of misconception about
[78:06] hallucinations like sometimes
[78:08] hallucinations are actually good right
[78:10] if you have a creative writing assistant
[78:11] and you wanted to come up with some cool
[78:13] new ideas you want the language model to
[78:15] hallucinate uh so I I think what you
[78:18] want to have is kind of a tunable knob
[78:19] where you say like now you can
[78:21] hallucinate and now maybe you should
[78:22] like really tell me the truth
[78:30] only anything
[78:38] else control
[78:41] yeah yeah so but the temperature that's
[78:44] just about how you sample right so how
[78:46] flat your your distribution is
[78:50] sample
[78:51] yeah
[78:53] yes but so even if you have a low
[78:55] temperature it can still come up with
[78:57] random stuff right so it just says that
[79:00] then you're very likely to do like
[79:01] greedy sampling um so so I I think what
[79:05] you want to get at is is something more
[79:07] sophisticated than
[79:14] that lots of interesting questions yeah
[79:17] I like the question thank again for the
[79:19] great
[79:21] than