TubeSum ← Transcribe a video

Stanford CS25: V3 I Retrieval Augmented Language Models

Transcribed Jun 20, 2026 Watch on YouTube ↗
Intermediate 16 min read For: Students, researchers, and practitioners in machine learning and NLP with a basic understanding of language models.
201.6K
Views
3.9K
Likes
75
Comments
16
Dislikes
2.0%
📊 Average

AI Summary

This lecture by Douwe Kiela, CEO of Contextual AI, provides a comprehensive overview of Retrieval-Augmented Generation (RAG). It covers the evolution from simple 'frozen' RAG systems to more sophisticated, end-to-end optimized architectures, addressing key challenges like hallucination, attribution, and customization.

[01:55]
Language Models Are Not New

The core idea of language models (predicting the next token) is decades old, not invented by OpenAI. ChatGPT's key innovation was fixing the user interface through instruction tuning and alignment.

[04:55]
Problems with Pure LMs

Pure language models suffer from hallucination, lack of attribution, staleness, inability to revise information, and difficulty in customization.

[06:08]
RAG as a Solution

RAG couples a language model with an external memory (retriever), allowing it to access and ground its generation in retrieved information, solving many of the above problems.

[11:00]
Frozen RAG (RAG 1.0)

The most common form of RAG is 'frozen RAG', where a pre-trained retriever and generator are used without any joint training. This is criticized as a 'Frankenstein's monster'.

[12:15]
Sparse vs. Dense Retrieval

Sparse retrieval (e.g., BM25) counts word overlaps, while dense retrieval (e.g., DPR) uses embeddings for semantic similarity. Hybrid search combines both for best results.

[17:20]
FAISS Powers Vector Databases

FAISS is the open-source library that underlies most modern vector databases, enabling efficient approximate nearest neighbor search.

[25:30]
Replug: Training the Retriever

Replug trains the retriever by minimizing the KL divergence between its document distribution and the generator's perplexity-based distribution, without updating the generator.

[30:15]
Original RAG: End-to-End Training

The original RAG paper (2020) proposed end-to-end training of both retriever and generator, but it only supports a small number of documents in context.

[40:37]
Atlas: Query-Side Fine-Tuning is Key

The Atlas paper provides a comprehensive analysis of RAG training, finding that query-side fine-tuning (updating only the query encoder) is often sufficient if the document encoder is high-quality.

[58:50]
Lost in the Middle Problem

The 'lost in the middle' problem shows that models tend to ignore information in the middle of a long context, making RAG systems brittle.

[63:10]
Advanced RAG Techniques

Advanced RAG techniques include hybrid search, hypothetical document embeddings (HyDE), and active retrieval (e.g., FLARE) where the model decides when to search.

[68:50]
Future: RAG 2.0 and Systems Thinking

The future of RAG involves moving to 'RAG 2.0' with end-to-end optimization, treating the entire system (including chunking) as differentiable, and decoupling knowledge from reasoning.

Clickbait Check

95% Legit

"The title is accurate; the lecture is a deep dive into retrieval-augmented language models, exactly as promised."

Mentioned in this Video

Study Flashcards (12)

What was the main innovation of ChatGPT according to the lecture?

medium Click to reveal answer

To fix the user interface to the language model, making it easier for humans to interact with it.

03:25

List three problems with pure language models that RAG aims to solve.

medium Click to reveal answer

Hallucination, attribution (not knowing why the model says what it says), staleness (going out of date), inability to revise information, and difficulty in customization.

04:55

How does RAG solve the problem of model staleness?

hard Click to reveal answer

It adds a non-parametric retrieval component, allowing the model to access external information without memorizing it all.

07:55

What does BM25 stand for and why is it called that?

hard Click to reveal answer

BM25 stands for 'Best Match 25', named because it was the 25th attempt that worked.

13:25

What is the main difference between sparse and dense retrieval?

medium Click to reveal answer

Sparse retrieval (like BM25) counts word overlaps and is good for exact matches. Dense retrieval uses embeddings for semantic similarity, handling synonyms better.

12:15

What open-source library underlies most modern vector databases?

easy Click to reveal answer

FAISS (Facebook AI Similarity Search).

17:20

How does ColBERT differ from a standard bi-encoder dense retriever?

hard Click to reveal answer

It is a late interaction model that computes a more complex score by aggregating maximum similarity scores between words, rather than a simple dot product.

19:10

Explain the core idea behind the Replug paper for training a retriever.

hard Click to reveal answer

Replug minimizes the KL divergence between the retriever's document distribution and the generator's perplexity-based distribution, training the retriever to find documents that reduce the generator's perplexity.

25:30

Why is updating the document encoder in a RAG system expensive?

medium Click to reveal answer

It is very expensive because after every gradient update to the document encoder, you would need to re-encode the entire corpus (e.g., trillions of tokens).

38:00

What was a key finding of the Atlas paper regarding retriever training?

hard Click to reveal answer

The Atlas paper found that just updating the query encoder (query-side fine-tuning) is often good enough, provided the document encoder is already high-quality.

43:55

What is the 'lost in the middle' problem in the context of RAG?

medium Click to reveal answer

The 'lost in the middle' problem, where models tend to ignore information placed in the middle of a long context, paying more attention to the first and last items.

58:50

How does the lecturer define 'hallucination' in the context of RAG?

hard Click to reveal answer

Hallucination is when a model's output does not correspond to a known ground truth, specifically when it has retrieved information but fails to use it correctly. It is different from a general mistake.

75:55

💡 Key Takeaways

💡

ChatGPT's Real Innovation

Clarifies that ChatGPT's breakthrough was not inventing language models but fixing the user interface through instruction tuning and alignment.

03:25
🔧

Open-Book vs. Closed-Book Analogy

Provides an intuitive framework for understanding RAG: parametric models are like closed-book exams, while RAG is an open-book exam.

06:50
📊

Origin of BM25 Name

Reveals the quirky history of a fundamental algorithm, showing that progress often comes from iterative experimentation.

13:25
⚖️

Frankenstein's Monster Critique

Powerfully criticizes the common 'frozen RAG' approach as a cobbled-together system that lacks true integration and optimization.

31:55
💡

Lost in the Middle Problem

Highlights a critical failure mode of RAG systems where the order of retrieved documents significantly impacts performance.

58:50
🔧

Hypothetical Document Embeddings (HyDE)

Describes a clever technique to fix hallucination through hallucination, by generating a hypothetical answer and then searching for its nearest neighbors.

64:00
💡

Nuanced Definition of Hallucination

Provides a precise definition of hallucination in RAG, distinguishing it from general model mistakes and noting that it can sometimes be desirable (e.g., for creativity).

75:55

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

OpenAI didn't invent language models

45s

Corrects a common misconception that OpenAI invented language models, sparking debate and engagement.

▶ Play Clip

Why language models are still broken

56s

Lists critical problems like hallucination, attribution, and outdated knowledge that everyone using LLMs faces.

▶ Play Clip

Open book vs closed book exams for AI

47s

Uses a relatable exam analogy to explain the core idea of RAG, making complex tech easy to understand.

▶ Play Clip

Frankenstein's monster: Why most RAG systems fail

43s

Controversial take that popular frozen RAG systems are cobbled together and ineffective, challenging current practices.

▶ Play Clip

25x smaller model outperforms giant AI

35s

Surprising efficiency claim that a much smaller model can beat a huge one due to retrieval augmentation.

▶ Play Clip

[00:05] hey guys welcome to the our last lecture

[00:08] um of this quarter and we're very happy

[00:12] to have a daa here he's a the CEO of

[00:16] contextual AI um the Enterprise llm

[00:19] company as well as an Adjunct professor

[00:22] in symbolic systems here at Stanford and

[00:25] previously he was the head of research

[00:26] at hooking face and before that a

[00:28] research scientist Facebook AI research

[00:32] uh he received his PhD and masters from

[00:34] the University of Cambridge um as well

[00:36] as a master in logic from the University

[00:38] of Amsterdam and studied philosophy and

[00:40] cognitive AI in undergrad um and his

[00:43] work focuses on machine learning as well

[00:45] as NLP specifically on developing better

[00:48] models for language understanding and

[00:50] generation and better tools for

[00:52] evaluation and Ben yeah give it up for

[00:56] Adella

[00:58] right thank you so I guess I have to

[01:00] sort of stand here in the corner so

[01:02] people can see me on the zoom as well um

[01:06] yeah thanks so much for having me here

[01:09] um so I asked Steph what I should talk

[01:11] about there were a couple of things I

[01:13] could talk about multimodality or

[01:15] evaluation uh and this was the preferred

[01:18] topic I guess uh because the others were

[01:20] already covered um so yeah I'm I'm very

[01:24] happy to talk to you about everything

[01:25] retrieval augmentation um I think this

[01:28] is really one of the topics right now in

[01:31] our field um so I I'll just give you an

[01:34] overview of what's been happening and

[01:36] what I think are the interesting

[01:37] questions to think about um so first of

[01:41] all obviously in case you've missed it

[01:43] we are in the age of language models um

[01:46] and I just wanted to do a quick poll

[01:48] here in this not not super big audience

[01:51] I guess there's more people on the zoom

[01:52] but uh who invented language

[01:55] models if if you thought open AI then

[01:58] I'm angry with you right so so actually

[02:01] uh this is a very very old idea so the

[02:04] idea is just you you take a sequence and

[02:06] you factorize out the token

[02:07] probabilities right and um so it wasn't

[02:11] invented by open AI it's not like a few

[02:14] years old it's actually several decades

[02:16] old uh so I'm bringing this up because I

[02:19] was talking to someone and they were

[02:20] like open AI invented language models

[02:22] and I was like you're kidding me right

[02:24] um so um I I I went back to the

[02:27] literature and this is the oldest one I

[02:29] could find actually 1991 first neural

[02:31] language model um there's a very nice

[02:34] paper from 2003 from

[02:36] pjo where they they actually have like

[02:38] word embeddings and everything already

[02:40] in there uh so obviously these are LMS

[02:43] not llms um and as it turns out if you

[02:46] make them really big and you

[02:48] parameterize them with these massive uh

[02:50] neural Nets then you get something

[02:52] really powerful that really shows

[02:53] emerging uh properties right and that's

[02:55] why we're all excited in this stuff um

[02:59] so if we think about this from like a

[03:01] classic CS perspective there's input

[03:03] output right there's this kind of thing

[03:05] in the middle it's the generator so uh

[03:07] we take a sequence the input sequence

[03:10] and then the the task of the model is to

[03:12] predict the next token very very simple

[03:15] model um and and so you know that's why

[03:18] it was so easy to come up with this in

[03:20] 1991 already because it's like the idea

[03:22] is very intuitive but for a long time

[03:25] what was really broken with this was the

[03:27] user interface um and and this I think a

[03:31] lot of people kind of misunderstand what

[03:33] Chad gbt was about that's really what

[03:35] Chad gbt fixed right so that in

[03:38] initially you had to come up with these

[03:39] very weird prompts in order to get your

[03:41] language model to do what you wanted it

[03:43] to do uh and humans are terrible at this

[03:46] right so so we're much better at sort of

[03:48] telling people or things around us what

[03:50] we want right so if we have a dog we say

[03:52] sit we don't prompt it in a very weird

[03:54] way so that it sits right and it's the

[03:57] same with the language model if you

[03:58] wanted to generate some R lyrics in the

[04:01] style of a pirate or Shakespeare or

[04:03] something then you tell it generate some

[04:05] R lyrics in the style of a pirate right

[04:07] so that kind of instruction data

[04:10] actually turns out to be super super

[04:12] rare in just web data so what you need

[04:14] to do is you need to fix the user

[04:16] interface to the language model and the

[04:18] the classic recipe for doing that is the

[04:21] the sequence basically that chat gbt

[04:22] used right so you promp the model in a

[04:24] specific way you instruction find in the

[04:26] model and do you do some alignment rhf

[04:29] uh whatever you do on top of that so

[04:31] that's the first thing so now you have a

[04:33] working language model with a working

[04:36] user interface so are we done then um

[04:40] obviously we're not right so so right

[04:42] now language models are are kind of

[04:43] taking the World by storm but if you

[04:45] talk to anyone especially in an

[04:47] Enterprise for example where they have

[04:48] very strict uh accuracy requirements

[04:51] they will tell you that they can't

[04:53] really productionize this yet um and the

[04:55] reason is because there are all these

[04:57] familiar problems probably a bunch of

[04:58] you are working on these problems right

[05:00] now uh around

[05:02] hallucination um so these models they

[05:04] kind of make up stuff very often with

[05:05] very high confidence which is uh even

[05:08] more scary in a way attribution so we

[05:11] don't really know why these models are

[05:12] saying what they're saying Stillness

[05:15] they go out of date and so this was a

[05:17] big problem with sort of chat GPT not

[05:19] knowing anything that happened after a

[05:20] certain cut off date and they keep

[05:22] updating it every once in a while but

[05:24] you want to have a system that's always

[05:25] completely up to date that never goes

[05:27] still um you want to be able to to

[05:29] revise the information in the system so

[05:32] uh if you're uh a European organization

[05:34] you have to worry about gdpr uh which

[05:36] means that you need to be able to remove

[05:38] information from the language model or

[05:40] maybe Revis facts uh which we don't

[05:42] really know how to do right so again

[05:44] this is a very interesting uh area of

[05:46] study for a lot of folks model editing

[05:49] um but so this is something that we

[05:51] really want to be able to fix and then

[05:53] there's this big question of how do you

[05:55] customize these models uh so different

[05:58] people have different use cases you have

[06:00] different data if you're a company or if

[06:01] you want to have a language model on

[06:03] your own data how do you make it work on

[06:05] your own data so one of the solutions uh

[06:08] that everybody has started using right

[06:11] now is to couple it to an external

[06:12] memory so that's really just rag right

[06:15] the uh we we can this whole lecture is

[06:17] basically about rag uh but the way to

[06:20] understand uh what is going on here is

[06:23] uh we have this generator just like

[06:25] before we have the input and a prom just

[06:27] like before but now uh instead of just

[06:29] those two things we give this additional

[06:32] context so we contextualize the language

[06:34] model using things we retrieve and and

[06:37] the retriever uh is is very often pretty

[06:40] simple it's just a query in a documents

[06:42] encoder um and then you get a bunch of

[06:44] documents you give them as context

[06:46] through the model so super simple

[06:49] architecture um and I think it's useful

[06:53] to think about it from the perspective

[06:54] of of these two separate paradigms uh so

[06:57] if you've ever taken an exam I'm sure

[06:59] you have right uh you can have a close

[07:01] book exam where you have to memorize all

[07:03] of this so you have to cram all the

[07:04] knowledge into your parameters your

[07:07] neurons uh or you have an open book exam

[07:09] where you have all of this information

[07:11] in the book that you can access when you

[07:12] do the exam uh so it's a very similar

[07:15] thing with rag right you can just make

[07:17] it an open book setting where you give

[07:18] it access to this external information

[07:21] Wikipedia or something else or basically

[07:23] the entire internet uh and then have the

[07:26] language model do its job without having

[07:27] to memorize all of it in it

[07:30] parameters um so the other I think

[07:32] useful distinction here is that uh

[07:35] cramming everything into your parameters

[07:36] that's the parametric approach right so

[07:39] U what we're doing with rag is we're

[07:40] adding this non-parametric retrieval

[07:43] component um so uh you might call this

[07:45] semi- parametric um if you want to give

[07:48] this a

[07:49] name all right so why why does that

[07:52] actually solve these issues and so the

[07:55] answer is basically that if you have

[07:57] this separate Index right this separate

[07:59] retriever you can swap it in you can

[08:01] swap it out you can replace it with a

[08:03] new index so you can really customize it

[08:06] and so you can customize your language

[08:07] model system for what the user really

[08:10] wants to see um and then obviously you

[08:13] can update this index so um it doesn't

[08:15] really go still and you can revise it if

[08:17] everything goes wrong if anything goes

[08:20] wrong uh the other thing you get is

[08:22] grounding right so that that's initially

[08:24] why I became interested in this kind of

[08:26] architecture because I was thinking a

[08:27] lot about grounding and multimodal and

[08:29] things like that and actually one really

[08:31] nice way to ground things is to find

[08:33] some other information that you can

[08:35] ground your generation in so you really

[08:37] want the language model to only say

[08:39] things that it has evidence for in this

[08:41] outer piece of text or even multimodal

[08:44] data that it retriev separately so if

[08:46] you do that then you get less

[08:47] hallucination because you can always

[08:49] point back to your Source it's always

[08:50] grounded in your Source um and you get

[08:53] attribution because you know why the

[08:54] model is saying what it's saying it's

[08:56] because it founded this thing here is

[08:59] that

[09:02] all right so um for the rest of this

[09:05] lecture we're going to talk about this

[09:06] this basic architecture um and so it

[09:10] kind of looks like a pretty simple thing

[09:12] right uh but there are actually lots and

[09:13] lots of questions you can ask about what

[09:16] what this system should really look like

[09:18] um and like this this doesn't even cover

[09:20] like half the questions you can ask so

[09:23] it really is about how how do we

[09:25] optimize this entire system right so we

[09:28] have the separate components the

[09:29] retriever the generator and then um

[09:32] there are things like this query encoder

[09:34] how do we encode queries how do we uh do

[09:37] the retrieval do we update the document

[09:39] encoder how do we actually uh Define a

[09:42] document right is it like a full

[09:44] document or is it a paragraph or a chunk

[09:46] or a sentence or a couple of words um so

[09:48] there are lots of questions to ask and

[09:51] and uh as you'll see there are lots of

[09:53] possible answers to these questions as

[09:55] well um so this is what we'll we'll

[09:58] cover

[10:00] um so there are lots of

[10:03] architectures um going into these

[10:05] questions and I think as we go through

[10:07] them it's useful for you to think about

[10:10] what happens during training time and

[10:12] what happens during test time right so

[10:14] during training time is really uh okay

[10:16] we have this language model we have this

[10:18] retriever um which one do we update how

[10:21] do we update them how do we train this

[10:23] entire system do we maybe not train it

[10:25] at all uh do we pre-train it from

[10:28] scratch do we initially I it with uh

[10:30] components that were already separately

[10:32] trained these are the kinds of questions

[10:34] that that you have to answer if you want

[10:35] to design a system like this and then

[10:38] during test time uh you have this entire

[10:41] system right so actually multiple models

[10:43] in a way uh that are working together um

[10:47] so so there's also different things you

[10:48] can do there right so give it different

[10:50] indices during test time or uh

[10:52] manipulate kind of how you're sampling

[10:54] things like

[10:55] that so um the starting point for all of

[10:59] this stuff I think if you ask someone

[11:00] now like what is rag they will think of

[11:02] this thing um so this is frozen rag

[11:06] basically uh there's no training here at

[11:09] all so going back to this question of

[11:10] train time test time there's only test

[11:12] time here train time happen separately

[11:14] with these kind of blackbox models that

[11:16] we don't necessarily have control over

[11:18] right so there's this document embedding

[11:20] model uh whatever is currently at the

[11:23] top of some open source uh leaderboard

[11:26] uh you use that to oop sorry uh to get

[11:29] some vectors that you then use to create

[11:32] this Vector database and then the vector

[11:34] database just does search and it gives

[11:36] the information from the search to the

[11:38] language model and it just passes it as

[11:41] uh as the context right so this is this

[11:44] only works because of in context

[11:46] learning um and you know I think as a as

[11:50] a machine learner myself this feels very

[11:52] inelegant um so what what this lecture

[11:55] is about is can we do better than than

[11:57] this Frozen

[11:59] thing um so let's let's start from the

[12:03] the left side of this like okay if we

[12:05] want to outperform this Frozen thing

[12:07] itself with just the vector database

[12:09] like what would that look like from a

[12:11] retrieval

[12:12] perspective um and the starting point

[12:15] for everything retrieval is is tfidf

[12:17] does everybody know what tfidf is no

[12:22] okay so so tfidf is basically a sparse

[12:25] retrieval method where you have a score

[12:27] function uh that that looks at documents

[12:30] and queries so D and Q and then there

[12:33] are basically two terms that matter one

[12:35] is the TF the term frequency and the

[12:37] other is the IDF the inverse document

[12:39] frequency so this inverse document

[12:41] frequency is actually a really nice idea

[12:43] from Karen spark Jones really underrated

[12:45] researcher she's done some amazing work

[12:48] um but the basic idea is that you want

[12:51] to look at the words that are very

[12:53] special so that don't occur in lots of

[12:54] different documents and so the overlap

[12:56] between the word the doesn't really

[12:58] matter matter right like the occurs

[13:00] everywhere so you want to have sort of

[13:02] the special words so that's what what

[13:04] tfidf does in a nutshell it gives you a

[13:06] score for document query overlap and

[13:10] then you can do all kinds of things here

[13:12] with how how you weigh it so there's all

[13:14] these weird different parameters like

[13:15] this B and things like that that allow

[13:18] you to make it better than just having

[13:20] the the tfidf score so there's a couple

[13:22] of tweaks you can do there so bm25

[13:25] actually in case you're wondering stands

[13:27] for best match 2

[13:29] so I I try to discover like where does

[13:31] the 25 actually come from uh that's

[13:34] because the the prior s the preceding 24

[13:37] experiments failed right so it's

[13:39] literally the 25th one that seemed to

[13:41] work and that's why it's called

[13:42] bm25 it's bizarre right but um um so so

[13:46] this is spars retrieval it's just

[13:48] counting words right so you have this

[13:50] massive massive Vector of all these word

[13:53] occurrences it's sparse because most

[13:55] words never occur right so it's sort of

[13:57] like a vector of uh vocabulary size

[14:01] dimensions so most of that is obviously

[14:03] zero um but so that's actually kind of a

[14:06] nice property if you want to do fast

[14:08] search on a CPU right because on a CPU

[14:10] sparse uh do product is very easy to

[14:13] compute so um this is used in in the

[14:16] system called uh Dr QA which is really

[14:19] one of the first neural instances of

[14:22] this open domain sort of open book

[14:24] question answering Paradigm um so you

[14:27] have a question like how many of

[14:29] warsaw's inhabitants blah blah uh so you

[14:32] want to ask basically Wikipedia what the

[14:34] answer is for this so then you have this

[14:36] document retriever based on the sparse

[14:38] so bm25 I think in this case uh

[14:41] retrieval methods you pass that to um at

[14:44] this I think this was still by lsdm at

[14:47] the time um a document reader model and

[14:50] then that model gives you the answer um

[14:54] so this I think is really the first

[14:56] instance of having sort of this

[14:57] separation between a retrieval and a

[14:59] generator system that you use for

[15:02] answering complicated questions based on

[15:03] sort of open domain

[15:05] knowledge um so after The Spar stuff um

[15:10] there was a bunch of work on dense

[15:11] retrieval and and so the advantage of

[15:14] dense retrieval so this is just like

[15:16] word and Benes basically vectors right

[15:18] they're they're dense now no longer

[15:19] sparse so they're much uh smaller in

[15:22] terms of dimensionality and the nice

[15:24] advantage of of dense retrieval is that

[15:27] it's not really about specific work

[15:28] right so uh if there're synonyms you can

[15:31] still um find the relevant document uh

[15:35] which you couldn't really do with a

[15:36] sparse representation right so that's

[15:38] really the advantage of DSE is that you

[15:40] get like semantic

[15:41] similarity um so you can do this over

[15:45] word embeddings that doesn't really work

[15:46] all that well but uh at the time that

[15:49] people started thinking about this ber

[15:50] was already out there and ber is really

[15:52] great for giving you a vector

[15:53] representation for an entire sequence of

[15:55] words right so a sentence representation

[15:57] or a passage representation

[15:59] so there are all these cool systems like

[16:01] Orca and uh DPR the dense passage

[16:04] retriever where um they essentially use

[16:08] the retrieval as a kind of latent

[16:09] variable in the system U and and the way

[16:12] to get the latent variable to to work to

[16:14] be good enough essentially to train the

[16:16] entire system is to pre-train the

[16:19] retriever on uh relevant information so

[16:21] for Ora they do something called inverse

[16:24] close uh so they do kind of a close task

[16:27] where you want to find

[16:29] um passages that are sort of relevant to

[16:31] the preceding passage and in DPR they

[16:34] just train it on on a supervised thing

[16:36] but really the core idea here is that uh

[16:38] as you can see in this graph here you

[16:40] can do better than bm25 if you add lots

[16:43] of documents and the way you compute

[16:45] this score function is much simpler it's

[16:47] just a d

[16:48] product right um so the nice thing about

[16:52] D products is that you can do them very

[16:55] very efficiently on the GPU as well um

[16:58] if you uh know what you're doing so what

[17:01] you really want to get at is maximum in

[17:04] product search mips right this is one of

[17:05] the kind of core ideas of a lot of this

[17:07] stuff um and you can do mips with Ann

[17:12] approximate near neighbor search um and

[17:14] so there's this this really uh brilliant

[17:17] piece of work out of there for my

[17:19] colleagues at the time uh called phas

[17:22] which really underlies all of these uh

[17:24] modern Vector databases right so like

[17:27] all the popular ones they sort of

[17:28] re-implementations of this face idea one

[17:30] is in like rust one is in go but it's

[17:32] all basically the same idea it's just

[17:34] face um and so so face really Powers a

[17:37] lot of this stuff um and whenever

[17:40] somebody tells you something about a

[17:41] vector database just think about face

[17:44] very fast do

[17:46] product um so obviously you can go

[17:49] beyond do product yes what is it what is

[17:53] face um so so it's an open source

[17:56] Library Facebook AI similar

[18:02] search no so it's just basic off the

[18:04] shelf Ann

[18:09] algorithms yeah so so there are all

[18:12] kinds of different I don't know if you

[18:13] do you know what like product

[18:14] quantization is and things like that so

[18:17] there they're basically so you have a

[18:18] bunch of vectors uh and you can just

[18:21] compute the full dot product which is

[18:23] sort of inefficient right so what you

[18:25] can do is try to compress uh subspaces

[18:28] of the vector and then just look at the

[18:30] kind of

[18:31] centroids um so this so you can quantize

[18:34] sub vectors of the full vector and then

[18:36] do much faster search over just the

[18:41] centroids it's good question any other

[18:46] questions um all right so so about this

[18:49] dot product idea right so so what we

[18:52] have here is some people call this a

[18:54] Siamese Network I guess it is right so

[18:56] you have two different bir models uh or

[18:59] whatever your encoder is here and then

[19:00] at the end you get these two vectors and

[19:02] then you just do do product so you get

[19:04] one single score but you can do all

[19:06] kinds of much fancier things if you if

[19:08] you're willing to give up on this buy

[19:10] encoder uh approach right um so really

[19:13] nice example from from one of your

[19:15] colleagues here at Stanford uh is

[19:17] Colbert um so what this does is is late

[19:21] interaction uh so so instead of just

[19:24] having this dot product here you have a

[19:26] kind of more complicated uh

[19:28] version of computing a score where you

[19:30] aggregate over sort of Maximum

[19:32] similarity scores between different

[19:34] words so I only recently actually

[19:36] discovered that this is called Colberg

[19:38] because of the late night show Colberg

[19:40] so it's sort of Omar's joke actually

[19:43] this name but just just so you know if

[19:45] you run into it um so um but but I think

[19:51] if if we look at kind of where the

[19:52] state-of-the-art has has been going now

[19:55] one of the nice things about these

[19:56] Vector databases is that they're super

[19:58] efficient right so dot product is much

[20:00] more efficient than this late

[20:01] interaction stuff especially if you do

[20:03] the approximate nearest neighbor search

[20:05] um but there's been some really cool

[20:07] work so things like Spade uh they

[20:11] basically have have sparse meat dents in

[20:14] a way so one of the big problems as I

[20:15] said with spars is that you can't really

[20:17] handle synonyms and things like that but

[20:19] what you could do is take a dense model

[20:22] Like a Bird model look at kind of this

[20:24] this one word in your sequence try to

[20:27] see which other words in the same slot

[20:29] so that gives you the synonyms uh so now

[20:32] you can give all these synonyms to a

[20:34] sparse uh vector and then you can just

[20:36] do Spar doll product and so have a much

[20:39] much more efficient way to do search uh

[20:42] without sort of giving up on all the the

[20:44] cool stuff that you get from a dense

[20:46] representation um so that's one thing

[20:49] and this other idea I really like uh is

[20:51] called Dragon um so this I think is

[20:54] really the the the best generalized D

[20:57] dense retriever so if you want to take

[20:58] something off the shelf right now and

[20:59] just go to hugging face or something

[21:01] then this dragon or Dragon plus is

[21:03] probably the thing you want to use for a

[21:05] dense Retriever and the way they train

[21:07] this is is through this Progressive data

[21:10] augmentation strategy to make them the

[21:12] model better and better over time by

[21:13] sampling very difficult negatives um and

[21:16] that gives you very good uh

[21:19] representations um and and so the other

[21:21] thing about this I think this is the

[21:23] only only sort of final point about uh

[21:26] retrieval in general is that is that

[21:27] what we see happening right now if you

[21:29] look at sort of the developer Community

[21:31] around rag is that they're all doing

[21:32] hybrid search right now uh so you can

[21:35] actually just combine the search results

[21:37] from your sparse bm25 or whatever thing

[21:40] or spade and you can combine them with

[21:42] your dragon uh and then you get uh this

[21:45] ranking that works even better uh so

[21:47] then you kind of get Best of Both Worlds

[21:48] but then you get all these questions

[21:50] about how do you combine the

[21:52] results um any any questions on on this

[21:56] part oh can you hear me

[21:59] yes oh sorry um on the earlier slide uh

[22:02] was there has there been any work on um

[22:04] Benchmark how much less hallucination

[22:07] rag incurs over a closed book question

[22:10] answering for example directly asking

[22:12] the large language model the question

[22:14] has there been any benchmarking studies

[22:16] in this yeah so there there's a great

[22:18] paper if I can say so myself on the fact

[22:21] that retrieval augmentation reduces

[22:23] hallucination uh it's from 2021 I think

[22:26] um so so yeah you can just F find if you

[22:29] literally look for retrieval

[22:30] augmentation reduces hallucination then

[22:32] you'll find the paper uh thank

[22:43] you yeah so so uh very often you want to

[22:47] have um an very precise word overlap for

[22:51] things where you don't want to have the

[22:53] synonyms or the kind of nearest

[22:54] neighbors right so um if there's like a

[22:57] brand name name or or something like

[22:59] that then like let's say the brand is

[23:01] apple right you don't want to find stuff

[23:03] about pairs right so that's what you

[23:05] would do with a dense retriever um so so

[23:08] it really kind of depends on what you

[23:11] want to use it for that's why hybrid is

[23:13] probably the way to

[23:14] go it's a good

[23:17] question with the

[23:19] dance it's

[23:21] um it's contextualized that but

[23:24] shouldn't it realize Apple the company

[23:26] would be different from no so so if they

[23:29] were actually contextualized then yes

[23:31] but but very often it's a a frozen

[23:33] retrieval system right that's one of the

[23:35] problems with all the Frozen rag

[23:41] stuff I might be missing very

[23:44] B refering to the factors that

[23:48] you're factors that you're using is

[23:52] or uh no so so the the the the sort of

[23:58] document and the query that they're the

[24:00] same right so they're either sparse or

[24:02] they're dense but so if they're sparse

[24:04] the components of the vector are are

[24:06] literally the other

[24:09] work you just Oneal when

[24:12] you're the thing that

[24:16] creates uh how are you getting so it's

[24:20] literally counts right so so basically

[24:23] it's a one big Matrix of documents as

[24:26] rows and the columns are the words in

[24:28] the documents and then you just count

[24:30] how often a word occurs in a document

[24:33] right so that's as

[24:35] far also

[24:39] refering yeah and so so in the field we

[24:42] call them sparse sparse embeddings or

[24:45] sparse retrieval because most of that

[24:47] Vector is zero right because most wordss

[24:50] don't occur in that

[24:53] document does that make sense

[24:56] yeah

[24:58] cool um so um let's talk about uh doing

[25:04] slightly better so so going back to

[25:05] Stephen's question about okay we we have

[25:07] this kind of retrieval thing but like

[25:09] how do we actually make this retriever

[25:11] good for the context that is going to be

[25:13] used in right so can we contextualize

[25:15] the retriever for the generator uh even

[25:18] if it's it's a generator where we might

[25:20] not have access to the weights so it

[25:22] could be a gp4 model we just send it to

[25:24] some API we get some stuff back um

[25:28] and so uh one paper I really like is

[25:30] called replug um so just just to kind of

[25:33] explain what this looks like so you have

[25:35] this context you have a retriever that

[25:37] we do the the standard retrieval set

[25:39] with this is a dense retriever um and

[25:42] now sorry um and now you uh compute the

[25:45] the likelihood so basically just

[25:47] normalize the scores that you get for

[25:49] for the topk documents to get a

[25:52] distribution here and then uh you give

[25:54] each one of the retrieve documents

[25:57] separately to this generator to your

[25:59] language model so you can look at the

[26:02] perplexity of the correct answer for

[26:04] that language model right so now we have

[26:06] these two probability distributions or

[26:08] two likelihoods essentially and we can

[26:10] minimize the KL Divergence to make sure

[26:13] that we can actually uh retrieve the

[26:15] documents that lead to the lowest

[26:17] perplexity on the right answer for the

[26:19] language model um so super simple idea

[26:23] uh works really really well uh and the

[26:26] nice thing about this is is completely

[26:28] uh agnostic of what happens Upstream

[26:30] right so this will work for any sort of

[26:32] encoder decoder for any language model

[26:35] um what what you need is a perplexity

[26:38] score uh but for most language models

[26:40] you can get that not necessarily all of

[26:42] them so that's one thing and then

[26:44] there's this other really nice approach

[26:47] um what you what parameters are you

[26:50] changing so so in the retriever you're

[26:53] you're literally updating the uh the the

[26:56] dense representations

[26:58] right so your encoder basically for your

[27:00] dense representation that's good

[27:01] question we'll get more um so there's

[27:05] this another paper uh on in context

[27:07] retrieval augmented language models

[27:09] where the whole paper is basically about

[27:12] just doing bm25 and just giving stuff

[27:15] directly to the context of the language

[27:16] model and things kind of work so it's

[27:18] it's sort of Frozen rag but even even

[27:21] more primitive in a way where the the

[27:23] retriever is uh this very old sparse

[27:26] algorithm but it works really really

[27:27] well um but then they have this really

[27:30] awesome section where they they show

[27:32] that you can just have this uh ranker on

[27:35] top of the bm25 results um and you can

[27:38] backdrop into this ranker so now you

[27:40] still keep the language model completely

[27:42] fixed uh so that's sort of this part of

[27:45] the the loss here uh so you have kind of

[27:47] a stop gradient on the parameters data

[27:49] that's just your language model but now

[27:51] you have this uh this kind of rank

[27:54] function here that you can back propop

[27:56] into right so that's your ranker is

[27:58] basically can be a bir model or anything

[28:00] like that that works on top of the

[28:01] things you initially retrieve from your

[28:03] bm25 and now you have this bir reer

[28:05] ranker that you can backrop into um so

[28:09] this also works really really nice so

[28:11] we're slowly progressing towards having

[28:13] a system that is much more optimized for

[28:16] being properly uh retrieval augmented in

[28:19] a way where it's useful and and

[28:20] contextualized for what you want to use

[28:22] it

[28:23] for um so uh yeah just to point out kind

[28:26] of what that looks like with this ranker

[28:28] so you just have this extra step

[28:29] essentially right so we have our

[28:31] retriever then we have our ranker then

[28:33] we have our generator and our

[28:38] output no not

[28:41] necessarily um so so so for this one you

[28:44] do yeah but so for replug you don't

[28:47] right yeah yeah yeah yeah yeah so

[28:52] basically yeah you need to get do apis

[28:54] provide not all of them um some of them

[28:57] do right but but yeah there are all

[28:59] kinds of tricks you can do on top of

[29:01] that

[29:02] yeah um so

[29:04] so basically the question is how do we

[29:07] get sort of gradients flowing into this

[29:09] right so if you don't actually have

[29:10] access to the full parameters of model

[29:13] so that you can backrop all the way

[29:14] through it then you can uh do a

[29:17] reinforce style loss on on the retrieval

[29:20] and then you just pass the kind of log

[29:22] likelihood if you if you have access to

[29:23] that or some other kind of blackbox

[29:26] function

[29:31] all right so um I the next thing you can

[29:35] do uh is to optimize both the Retriever

[29:38] and the generator um and and so this

[29:41] really uh start starts getting to the

[29:43] the proper kind of contextualization of

[29:45] the entire architecture where you want

[29:47] everything to work together right so

[29:49] rather than having this Frozen thing

[29:50] where everything is basically not aware

[29:52] that the other part exists right it's

[29:54] like two halves of the brain they're not

[29:55] talking to each other one is your

[29:57] retriever that is your language model

[29:58] there's no connection they're just like

[30:00] sort of like something is thrown over

[30:01] the fence and then you hope for the best

[30:03] uh so instead of that we have everything

[30:05] much closer and learning together um so

[30:09] um one of the the first um ways of doing

[30:13] this with a generator uh was rag

[30:15] retrieval augmented generation uh which

[30:17] we did at ver in 2020 um and and it's

[30:22] very similar to what we've already seen

[30:23] we basically have this retriever here

[30:25] that works over different documents you

[30:27] get some score function uh that gets

[30:29] given to this generator um that that

[30:32] generates answer and now you want to

[30:34] backdrop all the way and update your

[30:36] generator as well right so in the

[30:38] previous two architectures we saw you

[30:40] keep the generator fixed you backdrop

[30:42] into your retriever but here we update

[30:45] everything well not exactly everything

[30:47] as you'll see but we'll we'll also

[30:49] update the the part of the Retriever and

[30:52] the

[30:53] generator um so in this rag model uh we

[30:56] actually have two different ways of

[30:58] doing this and this this is probably

[31:00] something that when we talk about this

[31:03] uh if you think about this long enough

[31:04] then you'll you'll think like okay but

[31:06] when actually do I need to retrieve like

[31:08] do I do I retrieve every time I generate

[31:11] a new token or do I just retrieve once

[31:13] and then generate an entire sequence

[31:16] right or maybe I want to retrieve every

[31:18] end uh tokens right so these are hyper

[31:21] prams or maybe I want to learn when to

[31:22] retreat as as we'll see that's also

[31:24] something people have done um so are are

[31:27] two different ways to do it um and and

[31:30] what we do in this paper basic the whole

[31:32] point of the paper is that this Frozen

[31:34] thing doesn't really work all that well

[31:37] right so I think what people Call Rag

[31:39] now is is usually refer refers to the

[31:42] Frozen thing uh but the whole paper

[31:44] basically would never have been accepted

[31:46] anywhere if we had just done the Frozen

[31:47] thing right the whole point of the paper

[31:49] is that you want to uh optimize it and

[31:52] so at my company contextual we call this

[31:55] Frozen thing Frankenstein's monster

[31:57] because it's really like you Cobble

[31:58] together these different pieces right

[32:00] you sort of yeah it's it's really like

[32:02] Frankenstein you just put it together

[32:04] and then it sort of walks you know uh

[32:05] but it doesn't really have a soul it

[32:07] doesn't really actually work it's not

[32:08] the real thing um so that's great for

[32:12] for everyone here I think because there

[32:14] are so many opportunities to do better

[32:15] than what what most people are using

[32:17] right

[32:18] now um so one of the limitations of of

[32:22] the original rag architecture is that it

[32:25] only supports a very small okay but so

[32:28] if you have lots and lots of documents

[32:30] uh then the problem is that you have to

[32:32] fit all of them in the context but how

[32:34] do you really get that uh to fit right

[32:38] so one thing you can do is you you first

[32:41] encode uh things so that you get one

[32:43] single representation or only the few s

[32:46] of top level representations then you

[32:48] concatenate those and then you just feed

[32:50] them to the decoder so this is FID

[32:52] Fusion in decoder um and as you can see

[32:55] the scales to a much higher uh number of

[32:58] of passages uh and that uh leads to

[33:01] corresponding improvements in uh the

[33:04] scores that you care

[33:06] about uh so that's a really cool idea

[33:08] and so so we're we're slowly moving

[33:10] towards more decoder only architectures

[33:13] right so in rag we have this bark model

[33:15] it's sort of an encoder decoder

[33:16] architecture but here you just have this

[33:18] decoder that does some fancy attention

[33:21] over stuff that you retrieved before um

[33:24] and and so another like pure decoder

[33:28] language model architecture um is this

[33:31] one

[33:32] KLM which I think is is very elegant in

[33:35] its simplicity so it's basically you

[33:37] just have a normal language model but uh

[33:40] you interpolate the normal language

[33:42] model weights with uh things that you

[33:45] retrieved um so basically you have some

[33:48] sort of prompt right so like Obama's

[33:50] birthplace is you go to your big Corpus

[33:52] you find similar things you look at the

[33:55] words that come next to the similar

[33:57] things uh you uh rank that thing you

[34:00] sample your top K you renormalize that

[34:03] so now you have a bunch of scores and

[34:05] now you can just interpolate between

[34:07] your retrieved kind of non-parametric

[34:10] memory scores and your parametric

[34:12] language model scores so this is very

[34:14] late Fusion in a sense right you at the

[34:16] very end you combine these two uh and it

[34:18] allows you to re reweight the pure

[34:20] language model probabilities or

[34:22] likelihoods um so this works really well

[34:25] and it scills especially well if you

[34:27] have a huge uh retrieval Corpus so if

[34:30] you have trillions and trillions of

[34:32] tokens in there you could have a much

[34:34] smaller language model that does not

[34:36] that much heavy lifting because you can

[34:37] really rely on this big Source Corpus

[34:40] that you're working from and so that

[34:42] idea was uh exploited by this paper

[34:45] called retro out of Deep Mind where uh

[34:49] they showed that you can have a 25 times

[34:51] smaller retrieval augmented language

[34:53] model trained from scratch so really

[34:55] pre-trained uh entirely from stretch

[34:57] that outperforms this 25 times bigger uh

[35:00] language model on the same data in terms

[35:02] of perplexity which is pretty impressive

[35:05] right so this architecture is much more

[35:06] efficient than a parametric model

[35:09] because you can rely on this external

[35:11] memory so if your external memory is big

[35:13] enough uh you can get pretty huge gains

[35:17] so there was a lot of excitement about

[35:19] retro when it was announced uh but it's

[35:21] a deep mind paper so there's really no

[35:23] open source nothing really to validate

[35:26] that this actually Works um and so very

[35:29] recently there has been a bit of work

[35:31] from Nvidia called retro

[35:33] Plus+ um where they have this hybrid

[35:36] between the Retro architecture and then

[35:39] they do basically Rags sort of they put

[35:41] the top one or the topk results in the

[35:44] context of the language model after all

[35:46] so it's sort of a crossover between Rag

[35:48] and retro and they show some really nice

[35:51] results here but I I think it's sort of

[35:53] pointing to this uh big flaw I think is

[35:56] that why is there still no good open

[35:58] source retro

[35:59] model that probably tells you something

[36:02] about whether it actually really works I

[36:04] I spent a lot of time in my career

[36:06] trying to reproduce deep mind papers

[36:08] that didn't necessarily always work uh

[36:11] and so I I think the the same is true

[36:14] for retro um and that's why we need to

[36:17] do this in context rag on top of retro

[36:19] to actually get it to

[36:21] work but could it just be a true book

[36:24] thing because you're searing onook

[36:28] yeah but so

[36:31] that no so the the doing retrieval over

[36:34] that to over that big Corpus is not that

[36:37] difficult actually yeah um so so they're

[36:40] even like distributed pH packages you

[36:43] can just do everything yourself so yeah

[36:46] so in terms of comput it's it's actually

[36:48] not that hard anymore to to reproduce

[36:50] something like this uh but I've tried

[36:53] several times and it it's not really

[36:55] reproducible

[36:57] so the only way to get it to work is if

[36:58] you do this in context rag on top of the

[37:00] Retro thing and then as you can see here

[37:02] in the results then it actually gives

[37:04] you a gain over the pure GPT model right

[37:06] so it starts from a GPT and then they

[37:08] kind of retrofit as they call it the GPT

[37:12] model so in short I think there's still

[37:14] a lot of work to be done in pre-training

[37:16] these systems really from scratch uh and

[37:18] retro kind of showed that it might be

[37:20] possible but we don't necessarily know

[37:22] exactly how to do it the right way and

[37:24] this is really one of the interesting

[37:26] open

[37:27] questions um any questions on

[37:33] that

[37:38] online no okay then we'll move on um so

[37:45] um let's go all the way with the

[37:47] contextualization now right so so with

[37:50] retro and with rag what we actually did

[37:53] is we only updated the query encoder uh

[37:56] so updating the document encoder is very

[38:00] expensive so one of the first papers

[38:03] actually kind of the the OG of the the

[38:05] non-frozen dense retrieval augmented

[38:07] methods is this uh paper called realm

[38:10] this is really like Visionary work this

[38:12] was basically the first uh uh kind of

[38:16] version that did this properly where

[38:18] they updated it all the way including

[38:20] the document encoder um so can can

[38:23] someone explain to me why it's expensive

[38:25] to update the document en

[38:30] coder so let's say we have a trillion

[38:32] tokens in our Corpus right and now so

[38:36] now we go all the way so we basically do

[38:38] a forward pass we get a gradient at the

[38:40] end now we back propagate the gradient

[38:42] through the retriever we update the

[38:44] query encoder now we have to update the

[38:46] document encoder so what do we then need

[38:48] to do after we've updated the document

[38:50] encoder we need to re-encode the entire

[38:53] internet right so basically every single

[38:56] gradient update we have to re-encode

[38:58] whatever our index is which so if this

[39:01] is like trillions of tokens it's like

[39:02] re-encoding the internet after every

[39:04] batch update so that's not very

[39:12] efficient

[39:15] change

[39:17] Stuff AC have

[39:20] some

[39:23] predictable

[39:25] yeah

[39:27] yeah that's one one way to do it uh so

[39:29] so there there are a bunch of different

[39:30] ways to update the the document encoder

[39:33] so what they do in realm is they

[39:35] basically do it for Te batches then they

[39:39] stop they re-encode the entire internet

[39:41] and then they train again uh so it's

[39:43] sort of asynchronous updates they have

[39:45] this very fancy sort of sharding

[39:47] mechanisms where they take down uh

[39:50] certain parts of their entire index uh

[39:52] and then update them kind of on the Fly

[39:55] uh so you can do it is just very

[39:57] expensive so one one of the things that

[39:59] a lot of people have been thinking about

[40:00] not exactly theora idea but but similar

[40:02] versions of that um are around like can

[40:06] can you make it more efficient so that

[40:07] you don't have to do do this

[40:11] asynchronously um so one of the

[40:13] downsides of this realm uh architecture

[40:16] is that it's really just a bird model

[40:18] but then you do this retrieval

[40:19] augmentation on a bird model with other

[40:21] bird models so it's not really

[40:22] generative it's not really gen in the

[40:25] modern Paradigm but if you want to read

[40:27] like one paper uh on this topic like

[40:30] this is a very good one to

[40:31] read uh the other one that is is really

[40:34] really good to read uh is this paper

[40:37] called Atlas uh so Atlas is um uh so

[40:41] this is out of fair um with a bunch of

[40:44] folks the folks who did like Rag and the

[40:46] folks who did FID and uh a really a

[40:49] brilliant set of people and and this is

[40:51] really a comprehensive uh analysis of

[40:54] everything that's happening in this Arch

[40:56] ecture so the first question they really

[40:58] look at is how do we train this

[41:00] retriever so we've seen a couple of

[41:01] versions of this um but uh which one

[41:05] actually works better they haven't

[41:06] really been compared in a head-to-head

[41:08] setting uh so one thing is we have this

[41:10] FID Styles s vention distillation uh so

[41:14] that's really too complicated to go uh

[41:16] into detail here but the others are

[41:18] actually very simple um so one is this

[41:21] loss we've basically seen before right

[41:24] uh so we've seen this I think with the

[41:26] in context rag one right so we have a

[41:28] stop gradient on the language model and

[41:30] then we update the retriever the other

[41:32] one is what we've seen with replug so

[41:35] this is basically exactly the replug

[41:37] loss right so we have the K Divergence

[41:39] of the um the documents and and sort of

[41:43] the Improvement that you see when you

[41:44] give it that document uh the other thing

[41:47] they have is basically the inverse of

[41:49] that one so if I take this one document

[41:52] out how does that affect my uh my

[41:55] perplexity of the language model right

[41:58] um and so this one I think is actually

[42:01] quite elegant because that really gets

[42:03] to like how valuable is this one single

[42:05] document for me answering this question

[42:08] correctly um so uh they compare all of

[42:12] these different versions and uh what you

[42:14] can see is that uh the the kind of

[42:17] replug style loss and this leave one out

[42:19] loss they perform a lot better than all

[42:21] of these others so this fixed retriever

[42:23] or no joint pre-training these are

[42:25] really kind of the Baseline sort of

[42:27] Frozen rag models or close book uh and

[42:30] as you can see you can do really a lot

[42:32] better uh if you optimize things and so

[42:35] this leave one outing is probably the

[42:38] best I would say um so then the other

[42:40] question is how do you actually like

[42:42] train that entire system like what data

[42:44] or what tasks do you train this on so

[42:46] they also uh experiment with a bunch of

[42:49] different versions uh so one is uh doing

[42:52] prefix LM if you're familiar with that

[42:54] uh so they basically take a chunk that

[42:57] occurs somewhere on the internet and

[42:59] then they predict the next Chunk from

[43:02] that chunk right so it's really like

[43:04] sentence to sentence so maybe like skip

[43:06] thought back in the day but now you have

[43:08] this retrieval step where you predict

[43:09] the next sentence uh then they just do T

[43:13] T5 Styles sort of D noising so that's

[43:15] Mass language modeling if you're

[43:16] familiar with T5 um and then they have

[43:19] this title to section generation piece

[43:21] so um I think the takeaway from this

[43:23] table is basically that whatever you do

[43:25] here so they're using T5 model so

[43:28] whatever you do here needs to be the

[43:29] same that your uh language model expects

[43:32] um so for T5 that's T5 style

[43:35] loss um and then uh the the the next

[43:39] sort of final question that they look

[43:40] into going back to to what we talked

[43:42] about how exactly do we update this

[43:45] retriever uh so do we have to update the

[43:47] document encoder or do we maybe have to

[43:50] do some sort of reranking uh or do we

[43:52] maybe just update the query um and and

[43:55] quite surprising L I think they find

[43:57] that just updating the query so like in

[43:59] the original rad paper is actually

[44:01] already basically good enough in many

[44:04] cases so so that's nice because it's

[44:07] much more efficient if you don't have to

[44:08] update your documents all the time uh I

[44:11] think the the real question here though

[44:13] is like uh how good is your document

[44:15] representation to begin with so you need

[44:18] to have very very high quality embedding

[44:20] model for this to work if you don't have

[44:22] that then this will not work but if you

[44:24] do have that then you get a very nice

[44:26] kind of query side fine-tuning

[44:31] thing U so the the atlas paper is about

[44:35] trying to do F shop um sort of language

[44:38] modeling tasks so it's how how many

[44:40] examples are given in the

[44:45] context um yeah so so the main takeaway

[44:49] um here is that if you compare like the

[44:51] Close book equivalent model to the

[44:53] retrieval augmented model uh you see

[44:56] very big

[44:58] improvements that's really the only

[45:00] takeaway of of this entire

[45:02] section um but I I think that that's

[45:06] really saying something uh in terms of

[45:09] what we should be thinking about um how

[45:11] how much time do I have

[45:14] in

[45:15] still okay okay all right other

[45:21] questions are the documents in the

[45:24] training step same as

[45:29] yeah so they can be different um in so

[45:33] in Atlas the athlet basically tries

[45:35] everything uh so they also try to see

[45:37] what happens if I train this on

[45:39] Wikipedia But I swap in like a sort of

[45:42] Comm and crawl index um and I think so

[45:45] in Atlas but also in retro domain

[45:47] finding is just the more the better uh

[45:50] so it's really just like the bigger your

[45:52] index the more likely you're you are to

[45:54] find the exact right thing um and then

[45:58] make the right

[46:04] prediction any other questions on this

[46:07] oh yeah uh sorry I this is a question

[46:09] about the generator in the I guess uh

[46:12] the rag system so um recently I saw a

[46:17] paper on mistal 7B so it introduces a

[46:20] lot of these uh new architectural

[46:22] changes like the sliding window

[46:23] attention to handle longer sequence is

[46:26] at a smaller cost and the group query

[46:28] attention for faster inference I'd like

[46:30] I'd like to like know your thoughts on

[46:33] designing a generator specifically for

[46:36] rag uh leveraging for example where

[46:38] mystal 7B currently is because for

[46:41] example like the sliding window

[46:43] attention I could see how that could be

[46:44] adapted to the rag

[46:47] case yeah so so maybe your read on sort

[46:49] of what makes mol's special is a bit

[46:52] different from mine so I I don't think

[46:53] that the sliding attention window thing

[46:55] is actually that interesting the reason

[46:57] mrol works so well is because it's

[46:58] trained on a lot of data uh and you can

[47:01] do that more efficiently because you

[47:02] have sliding window attention so you

[47:03] don't need to attend to everything um

[47:07] but uh so to answer your question I I

[47:10] guess you're asking sort of about the

[47:11] architecture of the generator if you

[47:14] know that there's going to be a

[47:15] retriever so I I I think uh that's

[47:18] basically what retro tried to do right

[47:20] so um retro actually some of the people

[47:24] on the Retro paper are at Mistral now uh

[47:27] so they they have this uh C chunk cross

[47:30] attention idea here so you basically

[47:32] have a language model but the way it

[47:34] does the tension over the things you

[47:36] retrieve in your retro um architecture

[47:41] uh you they they kind of get integrated

[47:43] into a model not using the standard

[47:45] detention mechanism but using this

[47:48] slightly different chunk cross

[47:50] detention oh okay so I think the the

[47:53] sliding window Attention Point I was

[47:55] trying to get get at was that uh it uses

[47:57] a fixed window so that whenever you're

[48:00] doing the query key computation in the

[48:02] attent with the query vectors and the

[48:04] key vectors you're using a fixed window

[48:07] attention so I think my idea was to

[48:10] actually one use a dynamic window

[48:13] because for example the rag case um if

[48:16] you use a fixed window when you're doing

[48:18] attenion it it is possible that you

[48:21] actually are leaving you you're only

[48:23] looking at a fixed uh span of

[48:26] information so if you could maybe adapt

[48:28] mistel so that you could make it better

[48:31] for the ride case and and for example

[48:33] the making the fixed window size the

[48:35] dynamic window uh yeah yeah I think it's

[48:39] an interesting idea so so for me uh the

[48:42] the what m is doing with with the

[48:44] sliding window that's basically like a

[48:46] conet right so we had all these

[48:48] convolutional like light comp Nets where

[48:51] where we would have word embeddings and

[48:52] you would do convolutions over it and

[48:54] then pull uh and then you would still

[48:56] get the information out so it's not that

[48:58] the sliding window prohibits you from

[49:00] looking earlier it's just that that

[49:02] happens higher up in your Transformer

[49:04] sort of yeah

[49:07] yeah so I think that definitely is an

[49:10] interesting direction to to think in

[49:12] yeah yeah so I think um it's like not

[49:15] too crazy to say are there any

[49:17] architectural changes that we can

[49:19] introduce into these 7 billion parameter

[49:21] models so that they could be better

[49:23] adapted to the rag case

[49:27] yeah so there there there might be yeah

[49:30] I I think one one question is just how

[49:32] do you how do you do the attention over

[49:33] things you've retrieved which I think is

[49:35] what

[49:37] you're yeah

[49:39] thanks so just to make sure I understand

[49:42] so I mean in this retro model you're

[49:45] retrieving in each

[49:47] block and when you talk about putting

[49:50] the retrieval in the context are you

[49:53] saying that you only do it at the

[49:54] beginning you don't do it

[49:57] yeah so so in context so this is it's

[50:00] not exactly every layer sort of so it's

[50:02] every token right so every um every step

[50:05] basically not every block so doesn't

[50:09] make sense so it's not every layer that

[50:12] you do to retrieval yeah so every step

[50:16] right um so so this is kind of like like

[50:19] what rag token is so you retrieve every

[50:21] token you so you generate and then you

[50:24] can retrieve again or in the case of

[50:26] retro you can generate like a chunk and

[50:28] then you retrieve chunks again uh if you

[50:31] look at the in context case you retrieve

[50:33] once at the beginning and then you give

[50:36] it you're say that during this

[50:41] nobody yeah but so the so the in Contex

[50:44] thing um so so here you don't actually

[50:48] give it as context at all like directly

[50:51] to the model right so here you get you

[50:53] let the decoder kind of tend over

[50:56] it

[51:02] yeah so I don't think cross attention

[51:05] really works yeah

[51:10] yeah other

[51:13] questions yeah we

[51:15] inside the the training of the retriever

[51:18] is not so necessary because of the

[51:21] large uh so I'm wondering what inside of

[51:24] the T like what cases are really need

[51:29] toiz update or anyway updates

[51:34] those yeah so you do want to update the

[51:36] retriever right but but only part of the

[51:38] retriever is necessary to be updated for

[51:41] a lot of these these cases um but so so

[51:46] I I think it uh so these are very

[51:48] specific data sets right natural

[51:50] questions wizard of Wikipedia and fever

[51:52] so they're really very uh kind of

[51:54] knowledge intens tasks uh so in that

[51:57] case if you already have a very good

[51:59] system like DPR that is specifically

[52:01] pre-trained for those tasks then you

[52:04] only need to update the query encoder

[52:06] but so I would expect that if you move

[52:08] Beyond this to kind of General language

[52:10] modeling things like like retro then you

[52:13] probably do want to update the document

[52:15] encoder at least in a way where you can

[52:17] scale

[52:18] it so that in the this part that is very

[52:23] much in

[52:33] as long as we have a good opal knowledge

[52:36] of what of the maybe the documents by

[52:39] those uh good

[52:43] models yeah but so you need to learn how

[52:45] to kind of query into that Index right

[52:48] so if you if you don't do that uh then

[52:51] then yeah you don't get really good

[52:53] performance so that's sort of like your

[52:54] close book performance right if you just

[52:57] have the language model and you're just

[52:59] like what what does the parametric model

[53:01] on its own without the retrieval what

[53:03] does it actually know as you can see

[53:05] there there are pretty big gaps there

[53:11] right other questions otherwise I will

[53:14] cover other

[53:17] questions no uh hello yeah go for it a

[53:21] quick question like so uh what about

[53:24] like more here at retrieval like I

[53:26] suppose there will be messes trying to

[53:28] not just retrieve a single chunk but

[53:30] some kind of like groups of chunks or

[53:31] something or summarized versions there

[53:34] there's been some interesting work on on

[53:36] doing that uh where you first tried to

[53:38] find so you can have multiple indices

[53:40] and they can kind of cascade right so

[53:41] first you want to find the relevant

[53:43] document so you have some document

[53:44] representation and then within that

[53:46] document you want to find the relevant

[53:48] chunk uh so you can do it sort of that

[53:50] direction you can also do it in reverse

[53:52] I think I I have something on the slide

[53:54] there where you can find the chunk and

[53:56] then sort of expand uh the context

[53:59] around it and then give that to the

[54:00] language model um so I think yeah there

[54:04] are all kinds of interesting things you

[54:05] can do

[54:07] there cool H thanks I guess another

[54:10] thing just like do can you compare rag

[54:13] versus like long context L efforts so

[54:16] there are lot of things like on around

[54:18] just having really long context and

[54:20] extreme it could replace rag but I know

[54:22] like if your takes yeah so so my my uh

[54:26] so everybody understands this question

[54:28] right so there there's there's a trend

[54:30] where we want to have very long context

[54:32] language model so that basically you can

[54:34] like take Harry Potter or something just

[54:36] put it into context and then ask a

[54:38] question like what is the name of like

[54:40] Harry Potter's owl or something right

[54:42] and then it can just attend over the

[54:43] entire thing um so attending over all of

[54:47] Harry Potter to answer that one question

[54:49] is super inefficient right uh so most of

[54:52] Harry Potter has nothing to do with the

[54:54] AL uh so but you are still kind of

[54:56] reading it if you do it with the long

[54:58] context window um so that's why I think

[55:01] the doing it the rag way where you have

[55:02] this non-parametric component is a much

[55:05] more efficient way to solve this problem

[55:07] and if you actually look at the

[55:09] literature on Long context Windows uh

[55:11] the way they they solve the problem of

[55:14] scaling the attenion mechanism is by

[55:16] making it very sparse uh so they're

[55:19] basically turning it so that's a

[55:20] different kind of spars but they're

[55:22] turning it into a non-parametric

[55:23] retrieval problem uh kind of behind the

[55:26] scenes so they're not they're not

[55:27] actually all that different if you want

[55:29] to scale long context then you're going

[55:30] to move towards a rag style

[55:34] architecture good

[55:38] thanks all right um so let's talk about

[55:41] some other interesting questions so one

[55:44] thing and I already alluded to this is

[55:47] when do we actually retrieve so very if

[55:49] we're doing like if we want to uh like

[55:51] retrieve every token that's also very

[55:54] inefficient because I probably don't

[55:56] have to retrieve to generate

[55:58] the right I can probably do that on my

[56:00] own with the language model is of a

[56:02] wayte to go and retrieve stuff but if I

[56:05] only retrieve once at the beginning of

[56:07] the sequence that's probably also not

[56:08] great right so so what we ideally want

[56:11] to be able to do is to say okay

[56:13] sometimes I want to retrieve sometimes I

[56:15] don't want to retrieve and I'm going to

[56:16] learn when I want to kind of expend the

[56:19] the compute Budget on doing the

[56:21] retrieval um so a nice paper where they

[56:24] have a stab at this is called flare for

[56:26] active retrieval augmentation where they

[56:28] basically have the language model decide

[56:31] uh when it should do a search and what

[56:33] it should do to search for um so so I I

[56:37] think this fits in a general Trend that

[56:39] you can see in the field around kind of

[56:41] Agents right so we can talk a little bit

[56:43] more about that too um so this other uh

[56:47] question that that I think we also kind

[56:49] of covered already here is how do we

[56:51] train this at scale right so we can do

[56:52] these asynchronous updates we can do

[56:54] reer rankers we can do query side only

[56:57] there's this really nice paper uh which

[56:59] is quite close I think to the idea you

[57:01] proposed uh where you first use bm25 to

[57:05] create a a batch basically where

[57:07] everything is very similar uh in terms

[57:10] of what you've retrieved and now you uh

[57:13] have this kind of inbatch update so it's

[57:16] it's sort of like a ranker where you

[57:17] encode the information that is just in

[57:19] your batch using this other model and

[57:22] now you can update this model on the fly

[57:24] so you don't have to worry too much

[57:25] about doing the full kind of documents

[57:27] side update um and again here what

[57:30] really matters is like how big is your

[57:32] index if you have an amazing index you

[57:33] can basically solve any problem just by

[57:35] looking it up right so rather than

[57:38] cramming it into your parameters you can

[57:40] just find it

[57:43] um this is a really nice paper uh called

[57:46] Silo so one one of the interesting

[57:48] things I think that's going to happen in

[57:50] the next year or two around language

[57:53] models is there and you've seen this

[57:54] already there's a bunch of like lawsuits

[57:56] against open Ai and other places around

[57:58] where does the data exactly come from um

[58:02] so one uh very elegant solution I think

[58:04] is to have a rag system that you train

[58:06] on data that you know is safe so you can

[58:09] train that thing on Wikipedia But now

[58:12] during test time you can give it a data

[58:14] store that has maybe slightly riskier uh

[58:17] information in it so this massive index

[58:20] of all the stuff on the internet

[58:21] including some things that are maybe um

[58:25] risk uh you can still have them in your

[58:27] index but your language model uh your

[58:29] retrieval augmented language model I

[58:31] should say you know that that thing is

[58:33] safe because it was strin on data that

[58:34] is public domain uh so that's what they

[58:36] do in Silo and they show that that works

[58:38] really well so that's uh one possible

[58:42] solution to to a lot of the the kind of

[58:44] compliance and legal risk around

[58:45] language model

[58:48] deployments um there's a great paper and

[58:51] also from one of your colleagues um

[58:54] around uh contexts getting lost in the

[58:57] middle I think this is also kind of a

[58:58] fascinating phenomenon this is on a

[59:00] frozen rag system um but U language

[59:05] models are very similar to humans in

[59:07] what things they pay attention to so if

[59:09] you give them a bunch of things that you

[59:11] retrieved what what they will look at

[59:13] are like the first things you list and

[59:15] the last things you list and they will

[59:16] sort of ignore the middle um so if it

[59:19] actually respected the rank function

[59:21] then then this curve would go down all

[59:23] the way right but it sort of go goes up

[59:26] um so I I I think that's a a very

[59:28] interesting observation which kind of

[59:30] shows that how brittle uh these these

[59:33] systems can be right so if you have a

[59:35] frozen rag system it can be very very

[59:37] brittle where like the order of the

[59:39] retreat context matters a lot in whether

[59:41] you get the right answer or

[59:44] not work on treating this as re problem

[59:48] sense

[59:50] ofor like specifically going for

[59:53] interpration out VOR that's going to

[59:56] inter prodct with just the right maybe

[1:00:00] you can tune for the particular

[1:00:04] dat yeah so what what I just described

[1:00:06] someone asked like how how do you

[1:00:08] actually so I said there are other ways

[1:00:10] to do this and then the question was how

[1:00:12] do you do that so the way you do that is

[1:00:13] using reinforce um so yeah there has

[1:00:17] been work on doing that um so some of

[1:00:20] the older papers were playing with this

[1:00:21] but one one of the big problems with uh

[1:00:25] so I think the replug solution isort of

[1:00:27] more elegant uh for solving that problem

[1:00:31] because you actually of use signal from

[1:00:33] the language model and if you just do

[1:00:34] reinforce it's very high variant so

[1:00:36] you're uh it's it's going to be super

[1:00:38] finicky if you don't want to destroy

[1:00:40] your

[1:00:42] index but people have tried it

[1:00:47] though um so um uh there's some some

[1:00:51] really nice work from open AI where they

[1:00:54] they basically basically show and again

[1:00:55] we're sort of like thinking more and

[1:00:57] more about agents here right uh where

[1:01:00] they show something very similar to the

[1:01:02] flare result from earlier with active

[1:01:03] retrieval that doesn't necessarily have

[1:01:05] to be some index that you own it can be

[1:01:07] just some some web search right um and

[1:01:10] obviously in this case you don't really

[1:01:12] have access to the web search

[1:01:13] necessarily so Bing or whatever they use

[1:01:15] here is not going to update its

[1:01:17] parameters uh but I just wanted to kind

[1:01:19] of put this in your mind like this is

[1:01:21] another thing you can do right and if we

[1:01:24] take this really to the general form uh

[1:01:27] then you can think of language models as

[1:01:29] just tool users um so rather than just

[1:01:32] retrieval augmenting language models we

[1:01:34] can tool augment language models and

[1:01:36] retrieval is just one of the many tools

[1:01:38] that language models have access to we

[1:01:40] can have uh rankers and things on top of

[1:01:43] the outputs of these tools um and so one

[1:01:45] of the the big questions I think uh is

[1:01:48] how do you actually get the system to to

[1:01:50] learn stuff right so we're going to need

[1:01:52] our help if we want this system to

[1:01:54] really learn learn how to take these

[1:01:55] actions uh

[1:01:57] properly

[1:01:58] um um and and so yeah this has been

[1:02:01] taken to to the extreme in this uh sort

[1:02:04] of self rag architecture where they have

[1:02:06] this sort of retrieval step and it's

[1:02:07] active and then you criticize it and

[1:02:09] then you uh basically do some natural

[1:02:11] language inference uh and all of that

[1:02:13] just with one language model to answer

[1:02:16] uh the

[1:02:17] questions um so the other missing piece

[1:02:20] so I'm just kind of going through a

[1:02:22] bunch of open questions uh that that

[1:02:24] people have looked at uh but feel free

[1:02:26] to interrupt me if there's anything you

[1:02:27] want to know um but so instruction

[1:02:30] tuning we established at the beginning

[1:02:32] of the lecture that this is pretty

[1:02:33] important for getting things to work so

[1:02:35] fixing the user interface um but the

[1:02:39] instruction tuning has almost always

[1:02:41] only happened on the language model and

[1:02:43] not on the entire system so I think one

[1:02:45] of the interesting uh things that people

[1:02:47] are looking at now with with things like

[1:02:49] RIT and instruct retro is how can we

[1:02:51] instruction fine to an entire retrieval

[1:02:53] augmented system so all the way into the

[1:02:55] retrieval step can we generate data so

[1:02:58] that that also follows the instructions

[1:03:00] properly which currently doesn't happen

[1:03:02] in any of these model

[1:03:04] architectures um and then finally I I

[1:03:07] think I would be remiss if I if I didn't

[1:03:09] really talk about what people call

[1:03:11] Advanced rag so so like the developer

[1:03:13] Community has been really doing some

[1:03:15] awesome stuff uh so like Frameworks like

[1:03:18] llama index and Lang chain and there's

[1:03:19] all these open source Vector databases

[1:03:21] like groma and wv8 and they're all sort

[1:03:24] of about making rag really easy but this

[1:03:26] is all Frozen rag right but even with

[1:03:29] frozen rag you can really do incredible

[1:03:31] things um so uh we mentioned some of

[1:03:34] these already so child parent recursive

[1:03:36] retriever so you find small small parts

[1:03:38] and then you give the big parts around

[1:03:40] it to the language model you can do

[1:03:42] hybrid search where we use reciprocal

[1:03:44] rank Fusion so we have like different

[1:03:45] search results that we then combine

[1:03:48] before we give the final thing to the

[1:03:49] language model there's zero shot like

[1:03:52] large language model ranker so basically

[1:03:54] the score function is not doesn't come

[1:03:56] from your retrieval it comes directly

[1:03:58] from the language model um and then uh

[1:04:01] hypothetical document and Bets which I

[1:04:02] think is a really cool idea so you just

[1:04:05] uh basically you fix hallucination

[1:04:07] through hallucination uh so you get a

[1:04:10] question then you let the language model

[1:04:12] hallucinate a bunch of possible answers

[1:04:14] then you go and search for nearest

[1:04:16] neighbors to the possible answers and

[1:04:17] you give those as context and then it

[1:04:19] gives the right answer based on that

[1:04:21] right so it's really like hallucinating

[1:04:23] answers and I think it's a brilliant

[1:04:26] solution um so there's a lot of stuff

[1:04:28] happening in in the kind of Frozen rack

[1:04:31] Community uh to that I think is very

[1:04:33] interesting to look at um so uh just to

[1:04:37] wrap up kind of looking at the future of

[1:04:40] this stuff uh there are still lots of

[1:04:42] very interesting open questions so if

[1:04:44] you're a student thinking about how to

[1:04:46] solve any of these I think you can have

[1:04:49] quite a lot of impact um so how how

[1:04:53] exactly do we do like pre-training of

[1:04:55] this architecture and do we even need to

[1:04:56] pre-train I think even retro kind of

[1:04:59] shows that you don't necessarily have to

[1:05:00] pre-train so but maybe there's something

[1:05:02] wrong with how we um how we do that what

[1:05:05] do skating laws look like so I think

[1:05:07] there's a really interesting question

[1:05:08] here around if I have a huge index and a

[1:05:11] very rich encoder of all the information

[1:05:13] in that index maybe I can move so

[1:05:16] basically decouple all the memorization

[1:05:18] to this index so I have a language model

[1:05:20] that doesn't know anything it just

[1:05:22] speaks English it just sort of re on top

[1:05:24] but it has no knowledge because that

[1:05:26] always comes from this retriever if you

[1:05:28] can do something like that then you get

[1:05:29] very interesting scaling tradeoffs right

[1:05:31] so you can have a tiny language model

[1:05:33] and and do your retrieval uh to do a lot

[1:05:36] of the heavy lifting with your retrieval

[1:05:38] which is nice because that's a cach

[1:05:40] computation right so you can just you

[1:05:42] already have the the embeddings you just

[1:05:44] need to do the dop product so it's much

[1:05:46] more efficient than kind of self

[1:05:48] attention in the language model um can

[1:05:51] we move Beyond bu encoder so Vector

[1:05:53] databases um I I like people who build

[1:05:56] Vector databases but I'm not sure how

[1:05:58] long we're going to keep Vector

[1:06:00] databases um because u i I think rer

[1:06:04] rankers probably work just as well and

[1:06:06] bm25 is much more efficient than a

[1:06:08] vector database um so I I don't really

[1:06:13] see why we need dedicated Vector

[1:06:15] databases and so what we're seeing but

[1:06:17] maybe this is a bit of a critique of uh

[1:06:20] maybe silicon value investment

[1:06:22] strategies and things like that but a

[1:06:23] lot of these

[1:06:24] um um Vector database companies are

[1:06:27] basically becoming database companies

[1:06:28] now so they are adding all this Spar

[1:06:30] stuff because the the densing is not

[1:06:32] enough um and as it turns out there are

[1:06:34] a lot of pretty good uh sparse databases

[1:06:38] out there already like postgress and

[1:06:39] things like that and they're also all

[1:06:41] adding vectors uh to their databases so

[1:06:45] uh I think that's all going to kind of

[1:06:46] coales into

[1:06:50] databases um so um I think there are so

[1:06:54] interesting things to look at for kind

[1:06:56] of the data so alluding to this

[1:06:57] instruction problem can we generate much

[1:07:00] better data for training rag systems

[1:07:03] synthetically uh and then I think

[1:07:05] there's this massive open question

[1:07:06] around how we actually measure whether

[1:07:08] the rag system is any good so right now

[1:07:10] we just look at Downstream performance

[1:07:13] um um which is sort of okay but if you

[1:07:15] mess up the retrieval it's very hard to

[1:07:17] measure um but how to how to measure

[1:07:20] whether your retrieval is right is also

[1:07:22] very difficult so there are some

[1:07:23] Frameworks where they try to take like

[1:07:25] the harmonic mean of your retrieval

[1:07:27] accuracy and your language model

[1:07:29] accuracy uh but I think those are also

[1:07:31] very shy because we don't really have

[1:07:33] very good uh data sets to measure that

[1:07:35] on so I think that's that's a very cool

[1:07:37] problem to work on as well um so the

[1:07:41] other problem that I personally am

[1:07:43] always very excited about is

[1:07:45] multimodality um and so why would we

[1:07:48] stop with rack systems with just text

[1:07:51] right so you can do the same thing with

[1:07:53] images uh you can augment language

[1:07:55] models with vision so we did this work

[1:07:57] on lens where we have a language model

[1:08:00] enhanced to see uh where you can just

[1:08:02] give kind of a computer vision pipeline

[1:08:05] just like a retrieval Pipeline and give

[1:08:07] that to a frozen language model and pass

[1:08:09] it to the context and that system

[1:08:11] actually is an amazing visual question

[1:08:13] answering system it's close to

[1:08:15] state-of-the-art uh sort of flamingo

[1:08:17] from Deep Mind which is also very hard

[1:08:19] to reproduce because there's no open

[1:08:21] source version of that um

[1:08:24] so so we've done some early work on this

[1:08:26] in in 2021 uh where we have this cross

[1:08:29] modal retrieval and there's some uh more

[1:08:32] recent workout of fair where they also

[1:08:34] look at this so I think that's really

[1:08:36] like if you look at the trend in the

[1:08:37] field like multimodality with GPD 4V and

[1:08:40] things like that is really a Hot Topic

[1:08:41] so everything is kind of going in that

[1:08:43] direction uh so it's an interesting

[1:08:45] thing to think

[1:08:47] about um so overall I think um it would

[1:08:51] be nice if everybody sort of moves away

[1:08:53] from from rag 1.0 to Frozen Frankenstein

[1:08:56] Rag and moves towards this much more

[1:08:58] kind of optimized version rag 2.0 so

[1:09:01] it's really about systems over models

[1:09:03] right it's not just your language model

[1:09:05] and your Retriever and they're kind of

[1:09:06] separate it's about thinking from the

[1:09:08] from a systems perspective about the

[1:09:10] entire thing and the problem you're

[1:09:11] trying to solve and so I think that

[1:09:14] really is the way that in deep learning

[1:09:16] things have always progressed where if

[1:09:17] you optimize the system end to end

[1:09:20] that's always going to win out like back

[1:09:21] in the day in computer vision or NLP we

[1:09:23] have like parsers and scam parsers and

[1:09:25] all this kind of stuff and all that just

[1:09:27] doesn't exist anymore now because we

[1:09:30] optimize the system end to endend U so

[1:09:32] that's what's going to happen here too U

[1:09:35] so if we take that to the extreme like

[1:09:36] there's a chunker thing in your

[1:09:38] documents right like put cutting it up

[1:09:39] into pieces like you could backdrop into

[1:09:41] that like why not somebody should really

[1:09:44] do that um and so yeah I I think like

[1:09:48] trading off cost and quality uh and zero

[1:09:50] shop domain generalization that's really

[1:09:52] like where this stuff is going to come

[1:09:53] in so language models right now they're

[1:09:55] amazing but very often they're way too

[1:09:57] expensive for being deployed somewhere

[1:09:59] where you can actually make money from

[1:10:01] them if you're in a company um so what

[1:10:03] you want to do is make it much more

[1:10:05] efficient and have the right cost

[1:10:07] quality tradeoff and the the easiest way

[1:10:09] I can think of is to do it through

[1:10:10] retrieval augmentation but obviously I'm

[1:10:12] I'm very biased um so uh yeah that that

[1:10:16] was all I had actually um so if you're

[1:10:18] interested in this I'm I'm at Stanford

[1:10:20] so I can work with you on research

[1:10:23] projects on these topics or if you want

[1:10:25] you can also join contextual because we

[1:10:27] work on this stuff every day thank

[1:10:30] you well um sorry I had a question from

[1:10:35] earlier yeah I think you said something

[1:10:37] really uh really I think really super

[1:10:40] helpful earlier about Mel 7B you talked

[1:10:42] about you compared the sliding window

[1:10:44] attention to convolutional neural

[1:10:46] networks and I do see the parallel

[1:10:48] because with convolutional neural

[1:10:49] networks you have uh several layers of

[1:10:51] several different layers of

[1:10:52] convolutional layers and the top

[1:10:54] convolution layers are able to see um a

[1:10:57] larger receptive field than the bottom

[1:10:58] convolution layers and um and with

[1:11:01] convolution layers you're able to tune

[1:11:03] the um filter sizes and the stride so

[1:11:07] you're able to see a different receptive

[1:11:09] field and I was wondering if you could

[1:11:11] see that same innovation in mistal 7B by

[1:11:14] tuning um because you have different

[1:11:16] Transformer layers and each Transformer

[1:11:18] layer will have a span over a different

[1:11:19] set of tokens and if you can tune I

[1:11:21] guess the Transformer architecture the

[1:11:23] way you tune those convolution layers

[1:11:25] the filter sizes the receptive field

[1:11:27] perhaps we can do some optimization in

[1:11:29] the Transformer realm that we have

[1:11:31] already done in convolution layers yeah

[1:11:34] I I think that so that's a good idea

[1:11:36] there's there's a great paper on light

[1:11:38] convolutions I think from Michael Ali

[1:11:40] and David G and a bunch of people where

[1:11:43] it's basically uh this this came out at

[1:11:46] exactly the same time as the Transformer

[1:11:48] and the Transformer is slightly more

[1:11:49] optimized for GPU computation but the

[1:11:52] the computional model was actually

[1:11:54] slightly better than the Transformer um

[1:11:57] so I it's definitely worth exploring

[1:12:00] okay cool

[1:12:04] thanks advant the re ranker

[1:12:07] with that does that

[1:12:12] advantages TR that yeah so it depends on

[1:12:15] the problem I I I think what you

[1:12:17] probably want to do is is sort of cast a

[1:12:19] white net with bm25 and then just narrow

[1:12:23] it down with then search uh so you you

[1:12:25] often see that kind of as a two-stage

[1:12:27] process where the first one is kind of

[1:12:28] noisy you can add noise actually to your

[1:12:31] retrieval and then you use the dense one

[1:12:33] to filter it

[1:12:35] down yeah everyone's trying to maybe

[1:12:39] adap their models to

[1:12:42] own domain specific area like I think

[1:12:46] there are many two ways project one way

[1:12:48] is to use instru tuning in learning way

[1:12:52] or B tuning like

[1:12:53] meth and another way is just the main

[1:12:56] topic of this lecture is using rual or

[1:13:01] so I'm Wonder besides the low cost

[1:13:03] advantage of theal AED way do you think

[1:13:07] the capacity or the quality of augmented

[1:13:11] can be with those

[1:13:13] T learning yeah so I I think actually

[1:13:17] what what's going to happen is that all

[1:13:19] of this will come together right so so

[1:13:22] if you train things like end to end rag

[1:13:25] 2.0 style then you can also fine-tune

[1:13:27] that system on some use case end to

[1:13:30] endend right so what why would you just

[1:13:33] take the retrieval augmented system if

[1:13:35] you can also F tune it on the thing you

[1:13:37] care about so I think in the end

[1:13:38] everybody's going to do all of those

[1:13:40] things and then there's questions like

[1:13:42] how do you do that efficiently so that's

[1:13:43] why you would use adapter or things like

[1:13:48] that think there was another

[1:13:52] question I'm curious about Hardware you

[1:13:54] say it's going to become database kind

[1:13:56] of thing respons database but what about

[1:14:00] retrieval hardware and you SM because

[1:14:05] we've thought so much of the you know

[1:14:07] the Le part but what about because it's

[1:14:11] hug trillions said so you have any ideas

[1:14:15] just a database problem so I don't know

[1:14:17] if I'm allowed to say this exactly

[1:14:19] actually but uh so one of the the

[1:14:23] biggest chip manufacturers that recently

[1:14:26] their stock has done really well they

[1:14:27] have some dedicated retrieval Hardware

[1:14:30] coming out I think soon or it might

[1:14:31] already be

[1:14:33] out um so yeah so yeah that

[1:14:37] like very efficient uh dense retrieval

[1:14:40] is a very big

[1:14:46] business are

[1:14:51] questions Sol

[1:14:58] um yes I I think I think so if you take

[1:15:01] it to the extreme so one of the big

[1:15:03] problems right now is that that if you

[1:15:05] contextualize an existing language model

[1:15:07] that already

[1:15:08] hallucinates then then it's going to be

[1:15:10] kind of hard to get rid of the

[1:15:11] hallucination right so if you do replug

[1:15:13] on

[1:15:14] gp4 gp4 might still hallucinate so you

[1:15:18] it could basically just ignore all the

[1:15:19] stuff you retrieved and just do whatever

[1:15:21] it wants anyway uh so that's one of the

[1:15:23] reasons why you want to train the system

[1:15:25] end to end and if you take that to the

[1:15:26] extreme where like I said right if you

[1:15:28] can just have the language model only

[1:15:31] reason and speak so it knows English and

[1:15:33] reasoning but it has no knowledge which

[1:15:35] all comes from somewhere else then then

[1:15:38] you can't lose an so it's really all

[1:15:40] grounded in whatever is in your

[1:15:47] index but they're so they're they're

[1:15:49] about hallucination I I'm sort of

[1:15:51] frustrated that a lot of people in the

[1:15:53] field misunderstand what hallucination

[1:15:55] even means right so a lot of people are

[1:15:57] conflating hallucination with

[1:15:58] correctness or incorrectness so they're

[1:16:00] like oh the model made a mistake it

[1:16:02] hallucinated it's like no it made a

[1:16:04] mistake that's different from

[1:16:06] hallucination hallucination I think is

[1:16:07] very specific kind of I retrieved

[1:16:10] something so I have some sort of

[1:16:11] counterfactual ground truth and what I'm

[1:16:14] saying uh does not correspond to that

[1:16:16] ground

[1:16:17] truth um and so yeah I think there's a

[1:16:22] bunch of folks that stand for also

[1:16:23] working on better like measurements of

[1:16:25] hallucination and definitions and things

[1:16:27] like

[1:16:30] that understanding correctly your of

[1:16:33] hallucination only sense in

[1:16:36] cont yeah of some ground truth right so

[1:16:40] so Hallucination is is really like there

[1:16:43] there is something that is true right so

[1:16:45] so if we're talking about like

[1:16:47] hallucination yeah so if we're talking

[1:16:48] about just general parametric language

[1:16:50] models then sort of the ground truth is

[1:16:52] whatever we can consider to be true

[1:16:56] right but we had to word for like

[1:16:59] language models making mistakes before

[1:17:01] it was called making

[1:17:06] mistakes yeah

[1:17:08] ground I guess you're solving the house

[1:17:12] question on that path are you working on

[1:17:15] on

[1:17:17] ground you

[1:17:19] know never been president everything

[1:17:26] this yeah so so I I like the sort of

[1:17:29] Silo mention there as well so I I think

[1:17:32] the whole point is that you can you can

[1:17:35] have different indices and different

[1:17:36] definitions of ground truth and so um I

[1:17:39] think you could say I only trust the

[1:17:42] archive or I only trust like peer review

[1:17:44] papers and not just archive uh and so

[1:17:47] you can make decisions in your

[1:17:49] architecture during test time about what

[1:17:50] You' Define as ground truth

[1:17:53] and I also think actually that uh and

[1:17:57] there's a bunch of work I think

[1:17:58] happening on this right now you can

[1:17:59] control for how how grounded you want to

[1:18:01] be in your ground TR so uh that's

[1:18:05] another kind of misconception about

[1:18:06] hallucinations like sometimes

[1:18:08] hallucinations are actually good right

[1:18:10] if you have a creative writing assistant

[1:18:11] and you wanted to come up with some cool

[1:18:13] new ideas you want the language model to

[1:18:15] hallucinate uh so I I think what you

[1:18:18] want to have is kind of a tunable knob

[1:18:19] where you say like now you can

[1:18:21] hallucinate and now maybe you should

[1:18:22] like really tell me the truth

[1:18:30] only anything

[1:18:38] else control

[1:18:41] yeah yeah so but the temperature that's

[1:18:44] just about how you sample right so how

[1:18:46] flat your your distribution is

[1:18:50] sample

[1:18:51] yeah

[1:18:53] yes but so even if you have a low

[1:18:55] temperature it can still come up with

[1:18:57] random stuff right so it just says that

[1:19:00] then you're very likely to do like

[1:19:01] greedy sampling um so so I I think what

[1:19:05] you want to get at is is something more

[1:19:07] sophisticated than

[1:19:14] that lots of interesting questions yeah

[1:19:17] I like the question thank again for the

[1:19:19] great

[1:19:21] than

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.