What is RAG? The skill every AI engineer needs
40sOpens with a high-demand skill mention and a relatable analogy, hooking viewers interested in AI careers.
▶ Play ClipThis video explains Retrieval Augmented Generation (RAG), a common skill in Gen AI job postings. It covers what RAG is, its types (vector, vectorless, hybrid, graph, SQL, and reasoning-based), and demonstrates a customer care chatbot project. The video also provides resources and interview questions.
RAG is like a smart student (LLM) with a book (external knowledge) for an open-book exam. The LLM uses its language skills to find answers in the provided document.
Step 1: Indexing – chunk documents, convert to vector embeddings, store in vector DB. Step 2: Retrieval – embed user query, find relevant chunks via semantic search, and feed them to LLM with the question.
RAG provides accurate answers and reduces hallucinations by grounding responses in source knowledge. It is cost-effective because only relevant chunks are sent to the LLM, reducing token usage.
A customer care assistant RAG project using LangChain, Chroma DB, and Hugging Face embeddings. It ingests PDF, CSV (FAQs), and SQLite database (tickets) to answer user queries.
Naive RAG (vector) retrieves top-K chunks from vector DB. Hybrid RAG combines vector and keyword search for better results in production.
Keyword RAG (BM25, TF-IDF) works for exact keyword matches. Graph RAG uses knowledge graphs for multi-hop reasoning. SQL RAG converts natural language to SQL queries. Reasoning-based RAG (Page Index) uses document structure and LLM reasoning without vectors.
RAG is a powerful technique for grounding LLMs in external knowledge, improving accuracy and cost-efficiency. Understanding different RAG types helps choose the right approach for specific use cases.
"Title accurately reflects content: comprehensive RAG explanation with types and project demo."
What does RAG stand for?
Retrieval Augmented Generation
00:04
What are the two main steps in the RAG process?
Indexing and Retrieval
02:01
What is an embedding?
A process of converting text into a vector that represents its meaning.
03:53
Name two vector databases mentioned in the video.
Milvus, Qdrant, Chroma DB (any two)
04:44
What is the benefit of using RAG for cost?
It reduces token usage by sending only relevant chunks to the LLM, saving API costs.
06:05
What is hybrid RAG?
Combining vector search and keyword search in parallel and merging results.
09:28
What is SQL RAG?
Converting natural language questions to SQL queries using an LLM, executing them, and generating answers.
11:08
What is the key difference between vector RAG and vectorless RAG?
Vector RAG uses embeddings and vector databases; vectorless RAG uses keyword matching or reasoning without vectors.
08:18
What is the Page Index method?
A reasoning-based RAG that generates a tree structure from documents and uses LLM reasoning to find relevant chunks without vectors.
11:47
RAG Analogy
Provides an intuitive understanding of RAG using a student and open-book exam analogy.
00:42Two-Step RAG Process
Clearly explains the core mechanism of indexing and retrieval.
02:01Benefits of RAG
Highlights accuracy and cost-effectiveness as key advantages.
05:45RAG Categories Overview
Provides a comprehensive taxonomy of RAG types, including vector, vectorless, hybrid, graph, SQL, and reasoning-based.
08:18Page Index Method
Introduces a novel reasoning-based RAG approach without vectors.
11:47[00:00] In almost all Gen AI engineer job
[00:02] postings, you will find one common
[00:04] skill, retrieval augmented generation,
[00:06] also known as rag. In my company at
[00:09] Lake, when we build AI projects for our
[00:11] clients, more than 40% of these projects
[00:14] have rag component in it. So, what
[00:16] exactly is rag? What are different
[00:18] types? Is naive rag dead due to
[00:20] vectorless rag? We are going to cover
[00:22] all these rag basics in a very simple
[00:25] and intuitive language. We will not just
[00:27] talk theory. I will show you a rag
[00:30] project which is a customer care chatbot
[00:32] in telecom domain. In the end, I will
[00:35] share some useful resources including
[00:38] rag interview questions. All right,
[00:40] let's get started. Let's understand rag
[00:42] using a simple example. When you ask
[00:44] ChatGPT a policy question for some
[00:47] private company, it won't be able to
[00:49] answer it because ChatGPT is trained on
[00:53] general internet knowledge. It doesn't
[00:55] know the HR policy details of any
[00:58] private company. But, if you give HR
[01:02] policy document to this LLM, which acts
[01:04] like a brain, it should be able to read
[01:07] a relevant section and provide you with
[01:09] the answer. This is similar to having a
[01:13] very smart student Mira, who is a
[01:15] computer science student, and you are
[01:17] asking her to appear in a microbiology
[01:20] exam, which is an open book exam. Now,
[01:22] Mira is generally good in terms of
[01:24] reading writing comprehension
[01:26] understanding, etc., but she doesn't
[01:28] know anything about microbiology. But,
[01:30] in this exam, she has been given a book
[01:33] on microbiology from which they are
[01:35] going to ask the questions. Now, Mira
[01:37] can use her reading, writing,
[01:40] comprehension skills to uh look at the
[01:42] book and she can write answers in the
[01:45] exam. So, here Mira's brain is like LLM,
[01:48] which has a good understanding of
[01:49] language, it has reasoning capabilities,
[01:52] and the book is like an HR policy
[01:54] document. It is an external knowledge
[01:56] where LLM can look into it and pull the
[01:59] answers. Let's now understand the
[02:01] two-step process of how rag works
[02:04] underneath. So, here I have HR policy
[02:06] document from Atliq, which will have a
[02:08] section on retirement benefits, okay?
[02:12] So, now if I go to ChatGPT, and if I
[02:14] copy this particular section, okay, and
[02:17] I ask my question related to
[02:20] contribution to employees retirement
[02:22] fund, in that case, ChatGPT will be able
[02:25] to answer that question. Because here
[02:27] you are asking question and providing uh
[02:31] knowledge as a reference in the context,
[02:33] and it can pull the answer. But, what if
[02:36] your HR policy document is 3,000 page
[02:38] PDF, okay? What if that knowledge is
[02:41] very big? What's going to happen in that
[02:44] case is you will run out of your context
[02:46] window limit. And even if you have a
[02:48] huge context window,
[02:50] uh you should still not feed the entire
[02:53] knowledge because it will be too many
[02:56] tokens, it will be costly. So, what
[02:58] people do is they will chunk this
[03:01] document. So, they will create, let's
[03:02] say, basic strategy is fixed-size
[03:06] chunks. And then, for a given question,
[03:09] you can pull the relevant chunks. So,
[03:11] for this particular question, let's say
[03:13] my first chunk is 70% uh probability
[03:17] that it will it will contain the answer.
[03:19] Second chunk is 60% match. And you can
[03:21] have, by the way, I'm showing just
[03:22] three, but you can have 1,000 chunks,
[03:24] and some of the chunks might have 5% or
[03:27] even 10% possibility that it may contain
[03:30] the answer. Let's say the chunk contains
[03:33] uh information on uh when Atliq was
[03:36] founded culture founders etc. then
[03:39] that doesn't has anything to do with the
[03:41] retirement question that you are asking,
[03:43] okay? So, the the relevance of that
[03:46] chunk will be very, very low. Now, how
[03:48] do you exactly find this kind of
[03:50] similarity? So, there is this concept of
[03:53] embeddings, okay? So, embedding is a
[03:55] process of converting text into a vector
[03:59] such that it can represent its meaning,
[04:02] okay? So, all the chunks, you will
[04:04] convert them into vector embeddings, and
[04:08] then you will store them into a vector
[04:10] database. This is different than your
[04:12] regular database. Your regular database
[04:15] can search using exact values, whereas
[04:17] vector database will be able to search
[04:20] using the meaning. So, when you search
[04:22] for, let's say, uh a company that is a
[04:25] leader in electric vehicle, it will
[04:29] return Tesla uh from the database. So,
[04:31] it is searching based on the meaning,
[04:33] not based on the exact word. To generate
[04:36] embedding, you can use variety of
[04:38] models, sentence transformer, text
[04:40] embedding three small, and so on. And
[04:42] there are many vector database choices
[04:44] that you have in market, Milvus,
[04:46] Quadrant, Chroma DB, and so on. This
[04:48] step is called indexing. This is the
[04:51] first step in rag process, where you are
[04:54] indexing all these vectors of chunks
[04:57] into a vector database. The second step
[04:59] is retrieval, where for a given
[05:01] question, you will generate embedding
[05:03] using the same embedding model. Then,
[05:05] you will try to find the relevant chunks
[05:08] in a vector database. So, here it is
[05:10] doing the semantic search, giving you a
[05:13] relevant vectors. You can specify top K
[05:16] factor, let's say I need two chunks or
[05:18] five chunks, and so on. And then, you
[05:20] will generate the
[05:22] actual text out of those chunks, and you
[05:25] will put it in your prompt along with
[05:28] the question. And when the question is
[05:30] given to LLM, it will give you the
[05:32] answer. So, here uh below the question,
[05:35] what you are doing is you are providing
[05:37] only the relevant chunks. So, this way
[05:39] LLM will not hallucinate, and it will
[05:42] give you accurate answer. That takes us
[05:45] into our next segment, which is the two
[05:47] major benefits of rag. The first one is
[05:50] the answers that you get will be highly
[05:53] accurate, and the chances of
[05:56] hallucination will reduce because you
[05:59] are grounding your responses in the
[06:01] knowledge, in the source of truth.
[06:03] Second, it is very cost-effective
[06:05] because if you pass the entire context,
[06:07] then you are sending too many tokens to
[06:10] LLM, and these LLM APIs, they charge you
[06:13] per token. So, if you send less number
[06:15] of tokens, only the relevant knowledge,
[06:17] then you will save money on your API
[06:19] bill. Here is a hands-on customer care
[06:22] assistant rag project. I have given the
[06:24] code in the video description below. You
[06:26] can ask different questions, for
[06:27] example, why is my mobile internet slow?
[06:30] And it will find the answer based on the
[06:33] knowledge that it has. So, the knowledge
[06:35] is stored in terms of the
[06:37] troubleshooting PDF file. So, here is
[06:39] the PDF file, and let's say you have
[06:42] this question on how do you want to
[06:44] enable the LTE, then it is pulling that
[06:47] answer from this particular PDF file.
[06:50] The other source is the CSV file
[06:53] containing all the FAQs. And the third
[06:56] source is a SQLite database containing
[06:59] all the past ticket. Here we are using
[07:01] Chroma DB as vector database. So, we are
[07:03] ingesting FAQs, then PDF, and tickets
[07:07] into Chroma DB, okay? So, these are the
[07:10] three files which is ingesting into a
[07:12] vector database. If you look at ingest
[07:14] PDF, here we are using the chunk size of
[07:17] 600, overlap of 100. We are using the
[07:19] recursive character text splitter
[07:22] strategy. And for embedding, we are
[07:24] using this particular embedding model
[07:26] from Hugging Face. As a framework, we
[07:28] have used LangChain. Now, the retriever
[07:31] will try to find relevant chunks from
[07:34] FAQ, tickets, or guides. In terms of
[07:37] LLM, we are using Quen from Chat Grok.
[07:40] Please download the project on your
[07:41] computer, try to run it to enhance your
[07:44] understanding on rag concepts. Telecom
[07:47] support chatbot that we just saw is an
[07:49] example of enterprise QA chatbot. There
[07:52] are many other industry use cases for
[07:55] rag. For example, you can build medical
[07:57] knowledge assistant, which can look at
[08:00] the vast amount of medical knowledge and
[08:02] pulls a relevant answer for your query.
[08:05] The other one is legal and compliance
[08:07] tools. Once again, here the knowledge
[08:09] will be your legal documents, and you
[08:12] want to pull the most relevant and
[08:14] accurate answer. HR chatbot is another
[08:16] example. Let's now look at rag
[08:18] categories. The first one is a vector
[08:20] rag, and naive rag is the example that
[08:23] we just saw, where you pull the top K
[08:27] relevant chunks from a vector database
[08:29] and answer user's question. The second
[08:32] category is vectorless rag, in which you
[08:35] can perform a keyword rag. So, here you
[08:38] are not generating any vector
[08:40] embeddings. You don't have a vector
[08:42] database, but you are using keyword
[08:45] match, uh techniques like BM25, TF-IDF,
[08:49] etc., to uh query into the document
[08:52] using the exact keywords. This method
[08:55] will work when you have a lot of codes,
[08:58] jargons, IDs, citations. Let's say you
[09:01] are doing research, and you are always
[09:03] searching using some particular ID or a
[09:07] particular keyword, then this will work.
[09:10] This is weak for semantic understanding.
[09:12] When you are not doing exact keyword
[09:14] matching and searching using meaning,
[09:16] this is not a good choice. And the key
[09:19] tools that uses keyword rag concepts
[09:21] like BM25 is Elasticsearch and Apache
[09:25] Solr. The next category in vector rag is
[09:28] hybrid rag, where you're combining
[09:31] vector search and keyword search, okay?
[09:34] You do both of these in parallel and
[09:37] merge the results. This is best for most
[09:39] of the production systems. The key tools
[09:42] here are Elasticsearch plus any vector
[09:45] DB. Now, in Atliq, we worked on one rag
[09:49] project for our client, where we have
[09:51] developed our own custom hybrid method
[09:55] for doing rag, and we have given details
[09:57] of this approach in a different video.
[10:00] You can check it out if you are
[10:01] interested. Also, if you want to learn
[10:03] AI engineering by building production
[10:05] grade AI systems similar to the projects
[10:07] that I just mentioned, then check our AI
[10:10] engineering cohort where we have live
[10:13] sessions on weekends and we will teach
[10:16] you all the concepts plus you will build
[10:18] eight plus production grade projects.
[10:20] The next category in vector less rag is
[10:22] a graph rag. It is also known as KG rag.
[10:27] So here you will generate a knowledge
[10:29] graph. So let's say your knowledge is
[10:31] Elon Musk and all the companies he has
[10:33] founded. So in that case you will build
[10:35] this kind of knowledge graph where you
[10:37] will say Elon Musk founded Tesla,
[10:39] SpaceX, Neuralink, OpenAI and so on.
[10:42] And then these companies will be
[10:43] operating in these different domains. So
[10:45] these are all the entities and they are
[10:47] connected through some kind of
[10:49] relationship. Now when you ask a
[10:51] question, which companies are founded by
[10:53] Elon Musk which are working in AI, you
[10:56] will traverse this particular path,
[10:58] okay? So you will look at all the
[11:00] companies and then you will do breadth
[11:02] first traversal and you will find that
[11:05] OpenAI is working in AI. The next one is
[11:08] SQL rag. This is also known as text to
[11:11] SQL. This method is very simple. Let's
[11:13] say you have sales database which
[11:15] contains the sales of
[11:18] products. Now you are asking this
[11:19] question, which product sold the most
[11:21] last month? Using LLM you can first
[11:24] generate a query for that database. You
[11:26] will execute the query, get the results
[11:29] and then give it back to LLM to generate
[11:31] a comprehensive answer. Very simple
[11:33] technique. You are taking a sentence in
[11:36] a natural language, converting it to SQL
[11:38] using LLM and putting a query in your
[11:41] database to get the results. And the
[11:43] last method, which is relatively new, is
[11:47] called page index. It is reasoning based
[11:50] rag. So here let's say you have 3,000
[11:53] page PDF document. First you will
[11:56] generate the table of content, okay? The
[11:59] table of content or your information
[12:02] structure. This is like you are having a
[12:04] book and you are having all the chapter
[12:06] and topic layout. Now when somebody asks
[12:08] this question, what does the contract
[12:09] say about compensation for breach of
[12:11] contract, the LLM will use its reasoning
[12:15] capability and this particular table of
[12:18] content to traverse this particular
[12:22] graph and locate the thing that it is
[12:25] looking for. So for example, in this
[12:28] case it will first find out that this is
[12:30] related to performance of contracts
[12:32] because the contract is already
[12:33] executed. So it has to be related to
[12:35] this and then it finds compensation of
[12:38] breach. So it goes from here to here and
[12:40] then
[12:41] you are discussing loss. So due to that
[12:44] it will out of all these nodes, it will
[12:46] go to this particular node and it will
[12:49] pull the relevant document. Now this
[12:53] might give you an index and using index
[12:55] you might have to refer back to the
[12:57] original knowledge. So here is the
[12:59] GitHub for page index. It is known as
[13:02] vector less rag but it is one of the
[13:04] categories of vector less rag, okay? The
[13:07] right term is reasoning based rag. So
[13:10] here you can see from document you are
[13:11] generating a tree, which is your
[13:13] knowledge tree structure index of
[13:16] documents and then LLM will do its
[13:18] reasoning to find the relevant chunk.
[13:21] Here you are not using any vectors. You
[13:22] are not using any
[13:24] embeddings. No vector DB. Just by
[13:26] looking at the the structure, you know,
[13:30] the table layout, which looks something
[13:32] like this,
[13:33] you will try to find a given node, okay?
[13:36] And see here there is a summary. So
[13:38] using the summary, LLM can reason and it
[13:41] can say, "Okay, maybe the answer is in
[13:44] this particular node." And then it will
[13:46] go to that node, refer to the original
[13:48] document and pull the answer. I have
[13:50] attached this PDF in the video
[13:51] description below where you have
[13:53] categories of rag. You also have a table
[13:56] comparing when to use what. It is not
[13:59] that
[14:00] reasoning rag is here so you should not
[14:02] use naive rag. You should use it when
[14:05] you have general text Q&A bots, etc. And
[14:08] the complexity here is low. The
[14:10] complexity in case of page index is
[14:12] high. You should use it when you have
[14:14] hierarchical tree index LLM traversal.
[14:18] You know, these are the use cases. So
[14:19] you can use this table to determine when
[14:21] to use what kind of rag. And at the end
[14:24] we have rag interview questions. All
[14:26] right, folks. So please check it out. If
[14:28] you have any question, post in the
[14:29] comment box below.
[14:33] >> [music]
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.