[0:00] So, what if your AI go beyond is [0:01] cleaning data and actually learn from [0:03] your documents? No guessing, no [0:05] hallucination, just real answer pulled [0:08] directly from your own file. That's rag. [0:11] So, regular AI doesn't know your company [0:12] docs, your research paper, or your [0:14] private notes. It can search the web, [0:16] but it can't search your files. That's [0:18] exactly what rag, retrieval augmented [0:21] generation, solves. But here is what [0:23] most people miss. Rag is a system, but [0:25] learning how to calibrate is a skill. [0:27] So, how you chunk your data, what [0:29] similarity threshold would you set? This [0:31] decision determine whether the rags [0:32] works well or fails. So, today I will [0:35] show you exactly how the rag system [0:36] works [music] end to end. We're going to [0:38] break it down to three simple steps. [0:40] First, we'll build a complete local rag [0:42] system from scratch using a tool called [0:44] Anything LLM. So, we'll start from the [0:46] very beginning, installing Ollama, [0:47] uploading the documents, and getting [0:49] everything running locally. [0:51] Second, we'll look at the under the [0:52] hood. As we go, I'll explain the concept [0:54] like chunking, embedding, and vector [0:56] database the more practical way. And [0:59] finally, I'll walk you through the [1:01] setting that control the rag [1:02] performance. Things like chunk size, [1:04] similarity threshold, and temperature. [1:05] [music] [1:06] So, you know how to tune it for your own [1:08] use cases. So, by the end of this video, [1:10] you won't just have a working rag setup, [1:12] you'll actually understand how the [1:13] end-to-end system works behind the [1:15] scene. So, if that sounds interesting, [1:17] let's jump straight in. [1:20] All right. So, before we start [1:21] installing anything, let me quickly [1:22] explain what we're building here. So, we [1:24] need two things to make the rag work [1:26] locally. First, Ollama. This runs the AI [1:28] model on your machine. Think of it like [1:31] a brain that generate the response. [1:33] Second, Anything LLM. This is a rag [1:35] platform. It handles everything. [1:37] Splitting your documents into chunks, [1:39] converting them to a searchable [1:40] embeddings, storing them into a vector [1:42] database, and retrieving the relevant [1:44] pieces when you ask a question. [1:46] So, in a nutshell, Ollama provides [1:47] intelligence and Anything provides you [1:49] memory. Together, you get an AI that [1:51] actually read and answer your own [1:52] documents. So, let's start with Ollama. [1:56] Great. Let's check our system first. I'm [1:57] on Windows 11 with 32 GB of RAM. And you [2:00] don't need this much, 16 GB works [2:01] [music] too. So, if you're on Linux or [2:03] Mac, same process, nothing changes. So, [2:05] first thing we need is Ollama. This is [2:07] where we run the AI models locally on [2:09] your machines. So, let's head over to [2:10] ollama.com, go to the download page, and [2:13] grab the installer for Windows. So, one [2:15] command in PowerShell, paste. So, let it [2:17] download. [2:22] Once Ollama is installed, let's verify [2:24] the Ollama version, and we're good. [2:26] So, now we need a model. So, let's pull [2:28] Llama 3. This is a Meta's latest model, [2:30] around 4.7 gigs for 8 billion parameter [2:32] versions. Good balance [music] of speed [2:34] and quality. So, let it download. [2:39] And we can verify using the Ollama list [2:41] command, and there it is, Llama 3 is [2:43] ready to go. [2:44] Quick test, Ollama run then model name, [2:47] then put the message. And you can see a [2:49] response coming from our local LLM. [2:51] Perfect. Now, if you check the API part, [2:53] it is running on port 11434. And this is [2:56] what Anything LLM will connect to. So, [2:58] Ollama is ready. Let's move on. [3:00] Great. Next, we'll see how to set up [3:02] Anything LLM. This is a rag platform [3:03] that handles everything. [3:05] So, let's head to anythingllm.com and [3:07] download the desktop app. Click download [3:09] and run the installer. [3:12] You'll notice it is downloading the CUDA [3:13] libraries. It is Nvidia's GPU [3:15] acceleration. If you have Nvidia graphic [3:17] card, it will use the faster processing. [3:19] If you don't have, no worries, it will [3:20] automatically fall back to CPU. [3:23] Now, it is asking to download the [3:24] meeting assistant model. We don't need [3:26] this because we are using Ollama for AI [3:27] models. Click no and proceed. Now, the [3:30] installation part is completed now. [3:31] Let's launch Anything LLM. It opened [3:34] with a beautiful setup dashboard, where [3:36] we need to configure a couple of [3:37] settings. [3:38] So, click on get started and follow the [3:40] [music] wizard. [3:42] Next, you will see it is recommended to [3:43] download its own model, which is QM3 [3:45] currently. Since we go with a different [3:47] LLM, so let's go with the manual setup. [3:50] And here is interesting part. Let's [3:52] choose Ollama, and it will auto detect [3:54] the localhost 11434, and find our local [3:57] model. Right? Perfect. [3:59] Now, here is a quick summary, which is [4:01] important one to understand. So, LLM [4:03] provider is Ollama, which runs locally [4:05] on our machine. [4:07] Then the embedding, which is Anything [4:08] LLM Embedder, which also runs locally. [4:11] And finally, the vector database, [4:14] LanceDB, runs locally. So, essentially, [4:16] everything stays on our machine, no [4:18] cloud, no data leaving our computer [4:21] scenario here. Kind of a true private AI [4:23] setup. [4:24] Finally, skip the survey part, dismiss [4:26] the desktop assistant for now. We are [4:28] in. Anything LLM is set up. We are ready [4:31] to go with the rag in action. All right. [4:33] First, let's do a quick test. I'm going [4:34] to start with something simple, a basic [4:36] TXT file with few lines of information. [4:39] So, I created this text file with fake [4:41] company details. Acme Corp, CEO John [4:43] Smith, founded in 2019, headquartered in [4:46] Austin Texas. [4:48] Product called Widget Pro, and revenue [4:50] is 5 million. [4:51] Now, here is a thing. This is a [4:53] completely made-up one. I guess these [4:55] specific details won't exist on the [4:56] internet, and AI won't be knowing this [4:58] information on its own accurately. So, [5:01] we'll upload this file. Click on upload [5:03] the document, select our file, and you [5:05] will see it added a contest. Now, the [5:07] document is now part of this workspace. [5:09] Now, the real test. Let's ask what is [5:11] the document all about, kind of quick [5:13] summary. And look at that response. It [5:16] correctly identify Acme Corp, founded in [5:18] 2019, Austin, Texas, Widget Pro, the [5:20] revenue, and everything. Remember, this [5:22] is a made-up information. AI only know [5:25] this because it's actually read from our [5:26] document. Let's try another one. Who is [5:28] the CEO? John Smith. Exactly right. Now, [5:31] here is a key part. Click on the [5:33] citation, and look at this, [5:35] rag_basic_doc.txt [5:37] reference. It's showing exactly which [5:40] document it used to answer the question. [5:42] So, this is the rag working. AI did not [5:44] guess, it did not hallucinate, it [5:46] searched from our document, found the [5:48] relevant information, and answered based [5:50] on what it actually found. That's [5:52] retrieval augmented generation in [5:53] action. Retrieval, it searches document. [5:56] Augmented, it added that information to [5:58] the prompt. Generation, it generate the [6:00] response based on that context. [6:03] Now, that's a simple TXT file. Great. [6:05] Let's go with something more complex. I [6:07] have a 30-page technical guide here, a [6:08] PDF. Let's see how rag handles the real [6:11] documents. [6:12] First, we'll create a new workspace [6:14] [music] to keep things organized. Click [6:16] on the plus, call it GitHub docs. [6:19] Now, let's upload our PDF. Click upload [6:22] document, select the file. There it is. [6:25] Click save and embed. [6:27] Now, it's processing. It's take a moment [6:28] for larger files. [6:30] And there is a reason for that. [6:32] Something important happening behind the [6:33] scene. So, let's see what's actually [6:35] happened. Go to the workspace setting, [6:37] vector database tab, and look at the [6:39] numbers. Vector count is 21. So, what [6:42] does that mean? Our 30-page PDF becomes [6:45] 21 vectors. [6:47] In other words, the document has split [6:49] into 21 smaller pieces. [6:51] These are called chunks. Each chunks [6:53] roughly a paragraph or a sections. [6:57] Why do this? Because when you ask a [6:58] question, you don't need the entire [7:00] 30-page document. You just need a [7:02] specific paragraph that answers your [7:04] questions. [7:06] So, instead of searching through 30 page [7:08] every time, which would be slow, the [7:10] system search these 21 chunks, and find [7:13] just the relevant one. Much faster, much [7:16] more precise. [7:18] All right. Let's test it out. What is [7:20] Open Clo? How do I install it? [7:22] I'm just using the readme file of our [7:24] previous video GitHub document. And look [7:27] at this response. Exact description what [7:30] Open Clo is, and installation commands. [7:32] Multiple method, curl commands, Node.js [7:34] setup, all pulled directly from the PDF. [7:38] Let's check the citation for reference. [7:40] So, it searched through 21 chunks, [7:43] and found the four most relevant one, [7:46] and built the answer from those. This is [7:48] the power of rag. It is not guessing, [7:50] and it is not using general knowledge. [7:52] It's finding the specific information in [7:55] your document, [music] and answering [7:56] based on exactly what it found. [7:59] Now, if you look at the system [8:00] resources, I see 50% utilization on both [8:04] RAM and CPU. I don't have any GPU, and [8:07] the network shows zeros. That means [8:09] there is no cloud access, and [music] [8:11] completely private. [8:14] All right. Now, we understand how rag [8:15] works. But remember what I said at the [8:17] beginning, rag is a system, and learning [8:19] how to calibrate is a skill. Let me show [8:22] you what I mean, and the settings that [8:23] control everything. [8:25] So, once you understand these, you can [8:27] tune rag for your own specific use [8:29] cases. [8:30] So, first, the text splitter and the [8:32] chunking. So, go to the settings, the AI [8:34] provider, text splitter and chunking. [8:36] The chunk size is 1,000 characters, [8:39] roughly a paragraph. [8:41] This is how big each searchable pieces [8:42] is. And overlap is 20 characters. This [8:45] means that each chunk is slightly [8:47] overlap with the next one. So, you don't [8:49] lose the context at the boundaries. [8:52] Now, why does it matters? Smaller chunks [8:55] give you more precise search result. You [8:57] find exactly the sentence you need, but [9:00] you might be losing the surrounding [9:01] context. Larger chunks keep you more [9:03] context together, but search became less [9:06] precise. [9:07] So, here is the thing. Different [9:09] document need different settings. Legal [9:11] document with long structured paragraph [9:13] might need a larger chunks. Customer [9:15] support transcript might work better [9:17] with smaller chunks. [9:19] So, this is one of the calibration [9:21] decision that affect how the rag works [9:23] for your specific data. [9:25] Moving on, next is the embedder. This is [9:27] where the text became searchable. [9:30] We are using Anything LLM built-in [9:32] embedder all-MiniLM-L6-v2. [9:35] So, what does this actually do? [9:37] Computers can't search text by meaning. [9:40] They search numbers. So, this model [9:42] convert each chunks into a long list of [9:44] numbers called embedding. [9:46] And here is a magic. Similar meaning [9:48] produce similar numbers. [9:50] So, let me explain with an example. [9:52] Keyword like "dogs allowed" and "pets [9:55] permitted", completely different words, [9:57] but same meaning. Their embedding would [10:00] be very close to each other. [10:02] But dogs allowed and remote work policy, [10:05] totally unrelated. Their embedding would [10:07] be far apart. [10:09] So, this is how the search find the [10:10] relevant content even when you don't use [10:12] exact keyword from the document. [10:15] It's matching the meaning, not just [10:17] words. [10:18] You could use other embedding provider [10:19] like OpenAI, Azure, Ollama. But the [10:22] built-in one runs completely locally and [10:24] works great. [10:26] Next is vector database. This is where [10:28] all this embedding gets stored. We are [10:31] using LanceDB. It's a building into [10:33] anything LM. Runs 100% locally. [10:37] Zero configuration needed. You could use [10:39] Chroma or Pinecone or other. [10:41] But for the private use, LanceDB is [10:43] perfect. [10:44] Think of it like pre-processed index. [10:47] Instead of searching raw text every [10:49] time, which is slow, we search these [10:51] numbers that is almost instant. [10:54] Now, let's look at the workspace level [10:56] settings. Similarity threshold. This [10:59] controls how closely a chunk must match [11:01] your questions to be included in your [11:03] result. [11:04] >> [music] [11:05] >> No restriction means everything is [11:07] considered. [11:08] Low is 25% match or higher. So, you'll [11:12] get the results, but some might be [11:14] loosely related. Medium is 50%. High is [11:18] 75%. [11:19] Very strict, only most relevant chunks [11:22] make it through. [11:24] So, this is how it works. So, if you are [11:26] getting irrelevant information in your [11:28] answers, bump it up. [11:30] If you are getting no information found [11:32] when you know it exists in the document, [11:34] lower it. [11:35] And the search preference default is [11:37] fastest. [11:39] Accuracy optimized takes longer, but [11:42] find better matches. [11:44] So, now max context snippet. [11:47] How many chunks get sent to AI per [11:49] questions? [11:50] Default is four. Remember, our PDF had [11:53] 21 chunks. [11:55] When you ask a questions, system find [11:57] top four most similar chunks and passes [11:59] those to AI. [12:01] More chunks means broader context, but [12:04] slower response, more tokens used. Fewer [12:07] chunks is faster, but might miss the [12:09] relevant information. So, four is [12:11] usually a good balance. [12:14] Temperature. This control how creative [12:16] the AI gets. [12:18] Zero means deterministic. Same question [12:20] gives same answers every time. Very [12:23] factual, very focused. One means [12:26] creative, more varied, more random [12:28] response. [12:29] For rag, you generally want lower [12:31] temperature, maybe 0.3 to 0.5. [12:35] You want AI to stick what's actually in [12:38] the document. [12:39] Not get creative and start making things [12:42] up. If you are seeing hallucination even [12:44] with rag, try lower this. [12:47] And finally, system prompt. These are [12:49] the instruction that AI follow for every [12:51] response. [12:52] The default works fine, but you can [12:54] customize it. For example, add something [12:57] like always cite your sources. [12:59] And if you don't find the relevant [13:00] information in the documents, say so [13:02] instead of guessing. [13:04] This makes AI much more reliable. [13:07] And these settings together determine [13:08] how well the rag system performs. [13:12] Great, that's it. Now, we have fully [13:13] working local rag system running on your [13:15] machine with a complete privacy. [13:18] And more importantly, you understand the [13:19] core pipeline and the key settings [13:22] needed [music] to tune it for your own [13:23] data. [13:24] So, in the next video, we'll explore the [13:26] Open Web UI, a beautiful ChatGPT-like [13:28] interface that adds rag, chat history, [13:32] multi-model workflows. [13:33] All the commands and the configuration [13:35] for this video are in the GitHub. Link [13:37] in the description. [music] [13:39] If this helped, hit the subscribe, drop [13:41] the comments if you got the rag working. [13:43] And if you're stuck somewhere, [music] [13:44] let me know. [13:46] I'll see you on the next one. Take care.