[0:00] So, what if your AI go beyond is
[0:01] cleaning data and actually learn from
[0:03] your documents? No guessing, no
[0:05] hallucination, just real answer pulled
[0:08] directly from your own file. That's rag.
[0:11] So, regular AI doesn't know your company
[0:12] docs, your research paper, or your
[0:14] private notes. It can search the web,
[0:16] but it can't search your files. That's
[0:18] exactly what rag, retrieval augmented
[0:21] generation, solves. But here is what
[0:23] most people miss. Rag is a system, but
[0:25] learning how to calibrate is a skill.
[0:27] So, how you chunk your data, what
[0:29] similarity threshold would you set? This
[0:31] decision determine whether the rags
[0:32] works well or fails. So, today I will
[0:35] show you exactly how the rag system
[0:36] works [music] end to end. We're going to
[0:38] break it down to three simple steps.
[0:40] First, we'll build a complete local rag
[0:42] system from scratch using a tool called
[0:44] Anything LLM. So, we'll start from the
[0:46] very beginning, installing Ollama,
[0:47] uploading the documents, and getting
[0:49] everything running locally.
[0:51] Second, we'll look at the under the
[0:52] hood. As we go, I'll explain the concept
[0:54] like chunking, embedding, and vector
[0:56] database the more practical way. And
[0:59] finally, I'll walk you through the
[1:01] setting that control the rag
[1:02] performance. Things like chunk size,
[1:04] similarity threshold, and temperature.
[1:05] [music]
[1:06] So, you know how to tune it for your own
[1:08] use cases. So, by the end of this video,
[1:10] you won't just have a working rag setup,
[1:12] you'll actually understand how the
[1:13] end-to-end system works behind the
[1:15] scene. So, if that sounds interesting,
[1:17] let's jump straight in.
[1:20] All right. So, before we start
[1:21] installing anything, let me quickly
[1:22] explain what we're building here. So, we
[1:24] need two things to make the rag work
[1:26] locally. First, Ollama. This runs the AI
[1:28] model on your machine. Think of it like
[1:31] a brain that generate the response.
[1:33] Second, Anything LLM. This is a rag
[1:35] platform. It handles everything.
[1:37] Splitting your documents into chunks,
[1:39] converting them to a searchable
[1:40] embeddings, storing them into a vector
[1:42] database, and retrieving the relevant
[1:44] pieces when you ask a question.
[1:46] So, in a nutshell, Ollama provides
[1:47] intelligence and Anything provides you
[1:49] memory. Together, you get an AI that
[1:51] actually read and answer your own
[1:52] documents. So, let's start with Ollama.
[1:56] Great. Let's check our system first. I'm
[1:57] on Windows 11 with 32 GB of RAM. And you
[2:00] don't need this much, 16 GB works
[2:01] [music] too. So, if you're on Linux or
[2:03] Mac, same process, nothing changes. So,
[2:05] first thing we need is Ollama. This is
[2:07] where we run the AI models locally on
[2:09] your machines. So, let's head over to
[2:10] ollama.com, go to the download page, and
[2:13] grab the installer for Windows. So, one
[2:15] command in PowerShell, paste. So, let it
[2:17] download.
[2:22] Once Ollama is installed, let's verify
[2:24] the Ollama version, and we're good.
[2:26] So, now we need a model. So, let's pull
[2:28] Llama 3. This is a Meta's latest model,
[2:30] around 4.7 gigs for 8 billion parameter
[2:32] versions. Good balance [music] of speed
[2:34] and quality. So, let it download.
[2:39] And we can verify using the Ollama list
[2:41] command, and there it is, Llama 3 is
[2:43] ready to go.
[2:44] Quick test, Ollama run then model name,
[2:47] then put the message. And you can see a
[2:49] response coming from our local LLM.
[2:51] Perfect. Now, if you check the API part,
[2:53] it is running on port 11434. And this is
[2:56] what Anything LLM will connect to. So,
[2:58] Ollama is ready. Let's move on.
[3:00] Great. Next, we'll see how to set up
[3:02] Anything LLM. This is a rag platform
[3:03] that handles everything.
[3:05] So, let's head to anythingllm.com and
[3:07] download the desktop app. Click download
[3:09] and run the installer.
[3:12] You'll notice it is downloading the CUDA
[3:13] libraries. It is Nvidia's GPU
[3:15] acceleration. If you have Nvidia graphic
[3:17] card, it will use the faster processing.
[3:19] If you don't have, no worries, it will
[3:20] automatically fall back to CPU.
[3:23] Now, it is asking to download the
[3:24] meeting assistant model. We don't need
[3:26] this because we are using Ollama for AI
[3:27] models. Click no and proceed. Now, the
[3:30] installation part is completed now.
[3:31] Let's launch Anything LLM. It opened
[3:34] with a beautiful setup dashboard, where
[3:36] we need to configure a couple of
[3:37] settings.
[3:38] So, click on get started and follow the
[3:40] [music] wizard.
[3:42] Next, you will see it is recommended to
[3:43] download its own model, which is QM3
[3:45] currently. Since we go with a different
[3:47] LLM, so let's go with the manual setup.
[3:50] And here is interesting part. Let's
[3:52] choose Ollama, and it will auto detect
[3:54] the localhost 11434, and find our local
[3:57] model. Right? Perfect.
[3:59] Now, here is a quick summary, which is
[4:01] important one to understand. So, LLM
[4:03] provider is Ollama, which runs locally
[4:05] on our machine.
[4:07] Then the embedding, which is Anything
[4:08] LLM Embedder, which also runs locally.
[4:11] And finally, the vector database,
[4:14] LanceDB, runs locally. So, essentially,
[4:16] everything stays on our machine, no
[4:18] cloud, no data leaving our computer
[4:21] scenario here. Kind of a true private AI
[4:23] setup.
[4:24] Finally, skip the survey part, dismiss
[4:26] the desktop assistant for now. We are
[4:28] in. Anything LLM is set up. We are ready
[4:31] to go with the rag in action. All right.
[4:33] First, let's do a quick test. I'm going
[4:34] to start with something simple, a basic
[4:36] TXT file with few lines of information.
[4:39] So, I created this text file with fake
[4:41] company details. Acme Corp, CEO John
[4:43] Smith, founded in 2019, headquartered in
[4:46] Austin Texas.
[4:48] Product called Widget Pro, and revenue
[4:50] is 5 million.
[4:51] Now, here is a thing. This is a
[4:53] completely made-up one. I guess these
[4:55] specific details won't exist on the
[4:56] internet, and AI won't be knowing this
[4:58] information on its own accurately. So,
[5:01] we'll upload this file. Click on upload
[5:03] the document, select our file, and you
[5:05] will see it added a contest. Now, the
[5:07] document is now part of this workspace.
[5:09] Now, the real test. Let's ask what is
[5:11] the document all about, kind of quick
[5:13] summary. And look at that response. It
[5:16] correctly identify Acme Corp, founded in
[5:18] 2019, Austin, Texas, Widget Pro, the
[5:20] revenue, and everything. Remember, this
[5:22] is a made-up information. AI only know
[5:25] this because it's actually read from our
[5:26] document. Let's try another one. Who is
[5:28] the CEO? John Smith. Exactly right. Now,
[5:31] here is a key part. Click on the
[5:33] citation, and look at this,
[5:35] rag_basic_doc.txt
[5:37] reference. It's showing exactly which
[5:40] document it used to answer the question.
[5:42] So, this is the rag working. AI did not
[5:44] guess, it did not hallucinate, it
[5:46] searched from our document, found the
[5:48] relevant information, and answered based
[5:50] on what it actually found. That's
[5:52] retrieval augmented generation in
[5:53] action. Retrieval, it searches document.
[5:56] Augmented, it added that information to
[5:58] the prompt. Generation, it generate the
[6:00] response based on that context.
[6:03] Now, that's a simple TXT file. Great.
[6:05] Let's go with something more complex. I
[6:07] have a 30-page technical guide here, a
[6:08] PDF. Let's see how rag handles the real
[6:11] documents.
[6:12] First, we'll create a new workspace
[6:14] [music] to keep things organized. Click
[6:16] on the plus, call it GitHub docs.
[6:19] Now, let's upload our PDF. Click upload
[6:22] document, select the file. There it is.
[6:25] Click save and embed.
[6:27] Now, it's processing. It's take a moment
[6:28] for larger files.
[6:30] And there is a reason for that.
[6:32] Something important happening behind the
[6:33] scene. So, let's see what's actually
[6:35] happened. Go to the workspace setting,
[6:37] vector database tab, and look at the
[6:39] numbers. Vector count is 21. So, what
[6:42] does that mean? Our 30-page PDF becomes
[6:45] 21 vectors.
[6:47] In other words, the document has split
[6:49] into 21 smaller pieces.
[6:51] These are called chunks. Each chunks
[6:53] roughly a paragraph or a sections.
[6:57] Why do this? Because when you ask a
[6:58] question, you don't need the entire
[7:00] 30-page document. You just need a
[7:02] specific paragraph that answers your
[7:04] questions.
[7:06] So, instead of searching through 30 page
[7:08] every time, which would be slow, the
[7:10] system search these 21 chunks, and find
[7:13] just the relevant one. Much faster, much
[7:16] more precise.
[7:18] All right. Let's test it out. What is
[7:20] Open Clo? How do I install it?
[7:22] I'm just using the readme file of our
[7:24] previous video GitHub document. And look
[7:27] at this response. Exact description what
[7:30] Open Clo is, and installation commands.
[7:32] Multiple method, curl commands, Node.js
[7:34] setup, all pulled directly from the PDF.
[7:38] Let's check the citation for reference.
[7:40] So, it searched through 21 chunks,
[7:43] and found the four most relevant one,
[7:46] and built the answer from those. This is
[7:48] the power of rag. It is not guessing,
[7:50] and it is not using general knowledge.
[7:52] It's finding the specific information in
[7:55] your document, [music] and answering
[7:56] based on exactly what it found.
[7:59] Now, if you look at the system
[8:00] resources, I see 50% utilization on both
[8:04] RAM and CPU. I don't have any GPU, and
[8:07] the network shows zeros. That means
[8:09] there is no cloud access, and [music]
[8:11] completely private.
[8:14] All right. Now, we understand how rag
[8:15] works. But remember what I said at the
[8:17] beginning, rag is a system, and learning
[8:19] how to calibrate is a skill. Let me show
[8:22] you what I mean, and the settings that
[8:23] control everything.
[8:25] So, once you understand these, you can
[8:27] tune rag for your own specific use
[8:29] cases.
[8:30] So, first, the text splitter and the
[8:32] chunking. So, go to the settings, the AI
[8:34] provider, text splitter and chunking.
[8:36] The chunk size is 1,000 characters,
[8:39] roughly a paragraph.
[8:41] This is how big each searchable pieces
[8:42] is. And overlap is 20 characters. This
[8:45] means that each chunk is slightly
[8:47] overlap with the next one. So, you don't
[8:49] lose the context at the boundaries.
[8:52] Now, why does it matters? Smaller chunks
[8:55] give you more precise search result. You
[8:57] find exactly the sentence you need, but
[9:00] you might be losing the surrounding
[9:01] context. Larger chunks keep you more
[9:03] context together, but search became less
[9:06] precise.
[9:07] So, here is the thing. Different
[9:09] document need different settings. Legal
[9:11] document with long structured paragraph
[9:13] might need a larger chunks. Customer
[9:15] support transcript might work better
[9:17] with smaller chunks.
[9:19] So, this is one of the calibration
[9:21] decision that affect how the rag works
[9:23] for your specific data.
[9:25] Moving on, next is the embedder. This is
[9:27] where the text became searchable.
[9:30] We are using Anything LLM built-in
[9:32] embedder all-MiniLM-L6-v2.
[9:35] So, what does this actually do?
[9:37] Computers can't search text by meaning.
[9:40] They search numbers. So, this model
[9:42] convert each chunks into a long list of
[9:44] numbers called embedding.
[9:46] And here is a magic. Similar meaning
[9:48] produce similar numbers.
[9:50] So, let me explain with an example.
[9:52] Keyword like "dogs allowed" and "pets
[9:55] permitted", completely different words,
[9:57] but same meaning. Their embedding would
[10:00] be very close to each other.
[10:02] But dogs allowed and remote work policy,
[10:05] totally unrelated. Their embedding would
[10:07] be far apart.
[10:09] So, this is how the search find the
[10:10] relevant content even when you don't use
[10:12] exact keyword from the document.
[10:15] It's matching the meaning, not just
[10:17] words.
[10:18] You could use other embedding provider
[10:19] like OpenAI, Azure, Ollama. But the
[10:22] built-in one runs completely locally and
[10:24] works great.
[10:26] Next is vector database. This is where
[10:28] all this embedding gets stored. We are
[10:31] using LanceDB. It's a building into
[10:33] anything LM. Runs 100% locally.
[10:37] Zero configuration needed. You could use
[10:39] Chroma or Pinecone or other.
[10:41] But for the private use, LanceDB is
[10:43] perfect.
[10:44] Think of it like pre-processed index.
[10:47] Instead of searching raw text every
[10:49] time, which is slow, we search these
[10:51] numbers that is almost instant.
[10:54] Now, let's look at the workspace level
[10:56] settings. Similarity threshold. This
[10:59] controls how closely a chunk must match
[11:01] your questions to be included in your
[11:03] result.
[11:04] >> [music]
[11:05] >> No restriction means everything is
[11:07] considered.
[11:08] Low is 25% match or higher. So, you'll
[11:12] get the results, but some might be
[11:14] loosely related. Medium is 50%. High is
[11:18] 75%.
[11:19] Very strict, only most relevant chunks
[11:22] make it through.
[11:24] So, this is how it works. So, if you are
[11:26] getting irrelevant information in your
[11:28] answers, bump it up.
[11:30] If you are getting no information found
[11:32] when you know it exists in the document,
[11:34] lower it.
[11:35] And the search preference default is
[11:37] fastest.
[11:39] Accuracy optimized takes longer, but
[11:42] find better matches.
[11:44] So, now max context snippet.
[11:47] How many chunks get sent to AI per
[11:49] questions?
[11:50] Default is four. Remember, our PDF had
[11:53] 21 chunks.
[11:55] When you ask a questions, system find
[11:57] top four most similar chunks and passes
[11:59] those to AI.
[12:01] More chunks means broader context, but
[12:04] slower response, more tokens used. Fewer
[12:07] chunks is faster, but might miss the
[12:09] relevant information. So, four is
[12:11] usually a good balance.
[12:14] Temperature. This control how creative
[12:16] the AI gets.
[12:18] Zero means deterministic. Same question
[12:20] gives same answers every time. Very
[12:23] factual, very focused. One means
[12:26] creative, more varied, more random
[12:28] response.
[12:29] For rag, you generally want lower
[12:31] temperature, maybe 0.3 to 0.5.
[12:35] You want AI to stick what's actually in
[12:38] the document.
[12:39] Not get creative and start making things
[12:42] up. If you are seeing hallucination even
[12:44] with rag, try lower this.
[12:47] And finally, system prompt. These are
[12:49] the instruction that AI follow for every
[12:51] response.
[12:52] The default works fine, but you can
[12:54] customize it. For example, add something
[12:57] like always cite your sources.
[12:59] And if you don't find the relevant
[13:00] information in the documents, say so
[13:02] instead of guessing.
[13:04] This makes AI much more reliable.
[13:07] And these settings together determine
[13:08] how well the rag system performs.
[13:12] Great, that's it. Now, we have fully
[13:13] working local rag system running on your
[13:15] machine with a complete privacy.
[13:18] And more importantly, you understand the
[13:19] core pipeline and the key settings
[13:22] needed [music] to tune it for your own
[13:23] data.
[13:24] So, in the next video, we'll explore the
[13:26] Open Web UI, a beautiful ChatGPT-like
[13:28] interface that adds rag, chat history,
[13:32] multi-model workflows.
[13:33] All the commands and the configuration
[13:35] for this video are in the GitHub. Link
[13:37] in the description. [music]
[13:39] If this helped, hit the subscribe, drop
[13:41] the comments if you got the rag working.
[13:43] And if you're stuck somewhere, [music]
[13:44] let me know.
[13:46] I'll see you on the next one. Take care.