[0:00] For a long time after the initial AI boom, 
most chatbots were great for general questions,  
[0:05] but always fell short on current 
events or developments. However,  
[0:10] that has changed completely 
with the introduction of RAG.
[0:14] And here’s the kicker: if you’ve 
used ChatGPT, Gemini, Claude,  
[0:18] or Perplexity recently, you were probably 
already using it without even knowing.
[0:23] I’m Mantvydas from Oxylabs, and in 
this video, I’ll explain what RAG is,  
[0:27] how it works, and why it makes today’s AI 
systems much more reliable in real-world use.
[0:35] Before diving into definitions, we 
need to start with a simple problem.
[0:39] On their own, large language models don’t actually 
“look things up". They generate answers based on  
[0:44] patterns learned during training. This creates 
three big limitations: their knowledge can be  
[0:51] outdated, they can hallucinate when unsure, 
and they can’t see any proprietary data.
[0:56] So if you ask a base model a question it 
doesn’t know, it won’t say “I don’t know”,  
[1:01] but it will try to guess. RAG exists to fix 
this shortcoming. Not by changing the model,  
[1:07] but by changing what the 
model sees before it answers.
[1:11] So what exactly is RAG? This acronym 
stands for Retrieval-Augmented Generation. 
[1:18] Retrieval is the keyword here. Instead of 
asking a language model to answer only from  
[1:22] its training data, you first “retrieve” 
relevant information – such as documents,  
[1:28] files, or web data – and then let the 
model “generate” an answer with all that  
[1:33] context in front of it. In other words, 
RAG means “Find first, then generate”.
[1:40] Now, to get a better feel of how that works, 
let’s look at a basic flow of a RAG system.  
[1:45] This is the core idea, and it stays pretty 
much the same in almost every implementation.
[1:51] First, the user asks a question. Next, the system 
searches for relevant information in its retrieval  
[1:57] layer. Then, the retrieved information is 
assembled into context – basically added to  
[2:02] the prompt. Only after that, the language 
model itself gets involved. And finally,  
[2:08] the model generates an answer 
grounded in what it was given.
[2:12] This is the important bit: the 
language model never goes out to  
[2:16] search on its own. All the “looking things 
up” happens before the model is even called.
[2:22] Now let’s zoom into the most important part 
of RAG – the Retrieval Layer. This is where  
[2:27] most of the flexibility and the power comes from.
[2:31] Most RAG systems start by searching 
internal sources. That could be PDFs,  
[2:36] knowledge bases, company documents, or 
structured datasets. A RAG system uses  
[2:41] vectors to look for the most relevant pieces 
of text based on meaning, not keywords.
[2:47] However, the retrieval can go even further. 
Instead of stopping at internal data,  
[2:51] some systems also pull in external sources, 
like public databases or even live web data.  
[2:57] For example, something like the Oxylabs Web 
Scraper API can be plugged into the Retrieval  
[3:02] Layer to fetch real-time information from the 
web, such as e-commerce data or SERP results.
[3:09] Whichever path your flow takes, that data doesn’t 
magically update the model. It’s simply retrieved,  
[3:14] cleaned, and then injected as context before 
generation. So the model still does what it’s  
[3:19] best at – reasoning and language – but now 
it’s working with up-to-date information.
[3:25] Just keep in mind: all this 
retrieval happens outside the model,  
[3:28] and whatever comes back is 
passed into the main pipeline.
[3:31] Once it’s done, the system uses both the 
user query and the system’s instructions to  
[3:36] assemble context into a single upgraded 
input prompt. This is why RAG works,  
[3:42] not because models are smarter, but 
because their inputs are better.
[3:46] And the best part, this same 
process can be used for many  
[3:50] applications. Customer support bots, 
product assistants, research tools,  
[3:54] or internal search systems. The logic stays 
the same. The only thing changing is the data.
[4:00] So why does all of this matter? RAG gives 
you up-to-date answers, fewer hallucinations,  
[4:05] access to private or proprietary data, and 
more control over where information comes from.
[4:11] And maybe most importantly, you don’t need to 
retrain the model. If your data changes, you just  
[4:17] update the source or change what gets retrieved. 
That makes RAG much faster, incredibly cheaper,  
[4:25] and easier to maintain than most alternatives, 
such as LLM fine-tuning or full model retraining.
[4:31] So if you take one thing away from this video, 
let it be this: RAG isn’t a futuristic technique.  
[4:37] It’s how the best AI systems already work 
today. Retrieval first, generation second.
[4:43] Remember – when ChatGPT searches the 
web, when you upload a PDF to Claude,  
[4:47] or when Perplexity shows citations – 
that’s Retrieval-Augmented Generation  
[4:52] in action. Once you see that, a lot of 
modern AI suddenly makes much more sense.
[4:58] Wanna learn how to build your own 
RAG chatbot with OpenAI and web  
[5:02] scraping? We have a whole step-by-step 
guide linked in the description below.
[5:06] Or if you have any questions on how 
data can improve your AI workflows,  
[5:11] email us at support@oxylabs.io or write 
to us via the live chat on our homepage.
[5:17] Also, don’t forget to subscribe to our 
channel for more videos like this one  
[5:22] and consider joining our Discord community, 
where we discuss all things web scraping.
[5:27] Thank you for watching, and 
see you in the next one.