---
title: 'The Fastest Way to Local RAG (Ollama + AnythingLLM Setup)'
source: 'https://youtube.com/watch?v=1lBKmLwHtGs'
video_id: '1lBKmLwHtGs'
date: 2026-06-19
duration_sec: 0
---

# The Fastest Way to Local RAG (Ollama + AnythingLLM Setup)

> Source: [The Fastest Way to Local RAG (Ollama + AnythingLLM Setup)](https://youtube.com/watch?v=1lBKmLwHtGs)

## Summary

This video provides a step-by-step guide to building a local Retrieval-Augmented Generation (RAG) system using Ollama and AnythingLLM. It explains how RAG enables AI to answer questions based on your own documents, eliminating hallucinations and ensuring privacy. The tutorial covers installation, configuration, and key calibration settings for optimal performance.

### Key Points

- **What RAG Solves** [0:18] — RAG (Retrieval-Augmented Generation) allows AI to answer from your own documents, not just general knowledge.
- **Two Core Components** [1:21] — Ollama runs AI models locally; AnythingLLM handles document chunking, embedding, and retrieval.
- **Installing Ollama** [2:05] — Install Ollama from ollama.com, pull Llama 3 (8B, ~4.7 GB), and verify with a test command.
- **Setting Up AnythingLLM** [3:00] — Download AnythingLLM desktop app, choose manual setup, select Ollama as LLM provider, and keep everything local.
- **Testing with a Simple File** [4:33] — Upload a TXT file with fake company details; AI correctly answers based on the document, citing the source.
- **Handling Complex PDFs** [6:03] — A 30-page PDF is split into 21 chunks (vectors) for efficient search.
- **Chunk Size and Overlap** [8:36] — Default chunk size is 1000 characters with 20-character overlap; smaller chunks give precise search, larger chunks retain context.
- **How Embeddings Work** [9:37] — Embeddings convert text to numbers; similar meanings produce similar numbers, enabling semantic search.
- **Similarity Threshold** [10:56] — Similarity threshold (default: no restriction) controls how closely a chunk must match; adjust if results are irrelevant or missing.
- **Other Key Settings** [11:47] — Default max context snippets is 4; temperature should be low (0.3-0.5) for factual answers; system prompt can enforce citation.

## Transcript

So, what if your AI go beyond is
cleaning data and actually learn from
your documents? No guessing, no
hallucination, just real answer pulled
directly from your own file. That's rag.
So, regular AI doesn't know your company
docs, your research paper, or your
private notes. It can search the web,
but it can't search your files. That's
exactly what rag, retrieval augmented
generation, solves. But here is what
most people miss. Rag is a system, but
learning how to calibrate is a skill.
So, how you chunk your data, what
similarity threshold would you set? This
decision determine whether the rags
works well or fails. So, today I will
show you exactly how the rag system
works [music] end to end. We're going to
break it down to three simple steps.
First, we'll build a complete local rag
system from scratch using a tool called
Anything LLM. So, we'll start from the
very beginning, installing Ollama,
uploading the documents, and getting
everything running locally.
Second, we'll look at the under the
hood. As we go, I'll explain the concept
like chunking, embedding, and vector
database the more practical way. And
finally, I'll walk you through the
setting that control the rag
performance. Things like chunk size,
similarity threshold, and temperature.
[music]
So, you know how to tune it for your own
use cases. So, by the end of this video,
you won't just have a working rag setup,
you'll actually understand how the
end-to-end system works behind the
scene. So, if that sounds interesting,
let's jump straight in.
All right. So, before we start
installing anything, let me quickly
explain what we're building here. So, we
need two things to make the rag work
locally. First, Ollama. This runs the AI
model on your machine. Think of it like
a brain that generate the response.
Second, Anything LLM. This is a rag
platform. It handles everything.
Splitting your documents into chunks,
converting them to a searchable
embeddings, storing them into a vector
database, and retrieving the relevant
pieces when you ask a question.
So, in a nutshell, Ollama provides
intelligence and Anything provides you
memory. Together, you get an AI that
actually read and answer your own
documents. So, let's start with Ollama.
Great. Let's check our system first. I'm
on Windows 11 with 32 GB of RAM. And you
don't need this much, 16 GB works
[music] too. So, if you're on Linux or
Mac, same process, nothing changes. So,
first thing we need is Ollama. This is
where we run the AI models locally on
your machines. So, let's head over to
ollama.com, go to the download page, and
grab the installer for Windows. So, one
command in PowerShell, paste. So, let it
download.
Once Ollama is installed, let's verify
the Ollama version, and we're good.
So, now we need a model. So, let's pull
Llama 3. This is a Meta's latest model,
around 4.7 gigs for 8 billion parameter
versions. Good balance [music] of speed
and quality. So, let it download.
And we can verify using the Ollama list
command, and there it is, Llama 3 is
ready to go.
Quick test, Ollama run then model name,
then put the message. And you can see a
response coming from our local LLM.
Perfect. Now, if you check the API part,
it is running on port 11434. And this is
what Anything LLM will connect to. So,
Ollama is ready. Let's move on.
Great. Next, we'll see how to set up
Anything LLM. This is a rag platform
that handles everything.
So, let's head to anythingllm.com and
download the desktop app. Click download
and run the installer.
You'll notice it is downloading the CUDA
libraries. It is Nvidia's GPU
acceleration. If you have Nvidia graphic
card, it will use the faster processing.
If you don't have, no worries, it will
automatically fall back to CPU.
Now, it is asking to download the
meeting assistant model. We don't need
this because we are using Ollama for AI
models. Click no and proceed. Now, the
installation part is completed now.
Let's launch Anything LLM. It opened
with a beautiful setup dashboard, where
we need to configure a couple of
settings.
So, click on get started and follow the
[music] wizard.
Next, you will see it is recommended to
download its own model, which is QM3
currently. Since we go with a different
LLM, so let's go with the manual setup.
And here is interesting part. Let's
choose Ollama, and it will auto detect
the localhost 11434, and find our local
model. Right? Perfect.
Now, here is a quick summary, which is
important one to understand. So, LLM
provider is Ollama, which runs locally
on our machine.
Then the embedding, which is Anything
LLM Embedder, which also runs locally.
And finally, the vector database,
LanceDB, runs locally. So, essentially,
everything stays on our machine, no
cloud, no data leaving our computer
scenario here. Kind of a true private AI
setup.
Finally, skip the survey part, dismiss
the desktop assistant for now. We are
in. Anything LLM is set up. We are ready
to go with the rag in action. All right.
First, let's do a quick test. I'm going
to start with something simple, a basic
TXT file with few lines of information.
So, I created this text file with fake
company details. Acme Corp, CEO John
Smith, founded in 2019, headquartered in
Austin Texas.
Product called Widget Pro, and revenue
is 5 million.
Now, here is a thing. This is a
completely made-up one. I guess these
specific details won't exist on the
internet, and AI won't be knowing this
information on its own accurately. So,
we'll upload this file. Click on upload
the document, select our file, and you
will see it added a contest. Now, the
document is now part of this workspace.
Now, the real test. Let's ask what is
the document all about, kind of quick
summary. And look at that response. It
correctly identify Acme Corp, founded in
2019, Austin, Texas, Widget Pro, the
revenue, and everything. Remember, this
is a made-up information. AI only know
this because it's actually read from our
document. Let's try another one. Who is
the CEO? John Smith. Exactly right. Now,
here is a key part. Click on the
citation, and look at this,
rag_basic_doc.txt
reference. It's showing exactly which
document it used to answer the question.
So, this is the rag working. AI did not
guess, it did not hallucinate, it
searched from our document, found the
relevant information, and answered based
on what it actually found. That's
retrieval augmented generation in
action. Retrieval, it searches document.
Augmented, it added that information to
the prompt. Generation, it generate the
response based on that context.
Now, that's a simple TXT file. Great.
Let's go with something more complex. I
have a 30-page technical guide here, a
PDF. Let's see how rag handles the real
documents.
First, we'll create a new workspace
[music] to keep things organized. Click
on the plus, call it GitHub docs.
Now, let's upload our PDF. Click upload
document, select the file. There it is.
Click save and embed.
Now, it's processing. It's take a moment
for larger files.
And there is a reason for that.
Something important happening behind the
scene. So, let's see what's actually
happened. Go to the workspace setting,
vector database tab, and look at the
numbers. Vector count is 21. So, what
does that mean? Our 30-page PDF becomes
21 vectors.
In other words, the document has split
into 21 smaller pieces.
These are called chunks. Each chunks
roughly a paragraph or a sections.
Why do this? Because when you ask a
question, you don't need the entire
30-page document. You just need a
specific paragraph that answers your
questions.
So, instead of searching through 30 page
every time, which would be slow, the
system search these 21 chunks, and find
just the relevant one. Much faster, much
more precise.
All right. Let's test it out. What is
Open Clo? How do I install it?
I'm just using the readme file of our
previous video GitHub document. And look
at this response. Exact description what
Open Clo is, and installation commands.
Multiple method, curl commands, Node.js
setup, all pulled directly from the PDF.
Let's check the citation for reference.
So, it searched through 21 chunks,
and found the four most relevant one,
and built the answer from those. This is
the power of rag. It is not guessing,
and it is not using general knowledge.
It's finding the specific information in
your document, [music] and answering
based on exactly what it found.
Now, if you look at the system
resources, I see 50% utilization on both
RAM and CPU. I don't have any GPU, and
the network shows zeros. That means
there is no cloud access, and [music]
completely private.
All right. Now, we understand how rag
works. But remember what I said at the
beginning, rag is a system, and learning
how to calibrate is a skill. Let me show
you what I mean, and the settings that
control everything.
So, once you understand these, you can
tune rag for your own specific use
cases.
So, first, the text splitter and the
chunking. So, go to the settings, the AI
provider, text splitter and chunking.
The chunk size is 1,000 characters,
roughly a paragraph.
This is how big each searchable pieces
is. And overlap is 20 characters. This
means that each chunk is slightly
overlap with the next one. So, you don't
lose the context at the boundaries.
Now, why does it matters? Smaller chunks
give you more precise search result. You
find exactly the sentence you need, but
you might be losing the surrounding
context. Larger chunks keep you more
context together, but search became less
precise.
So, here is the thing. Different
document need different settings. Legal
document with long structured paragraph
might need a larger chunks. Customer
support transcript might work better
with smaller chunks.
So, this is one of the calibration
decision that affect how the rag works
for your specific data.
Moving on, next is the embedder. This is
where the text became searchable.
We are using Anything LLM built-in
embedder all-MiniLM-L6-v2.
So, what does this actually do?
Computers can't search text by meaning.
They search numbers. So, this model
convert each chunks into a long list of
numbers called embedding.
And here is a magic. Similar meaning
produce similar numbers.
So, let me explain with an example.
Keyword like "dogs allowed" and "pets
permitted", completely different words,
but same meaning. Their embedding would
be very close to each other.
But dogs allowed and remote work policy,
totally unrelated. Their embedding would
be far apart.
So, this is how the search find the
relevant content even when you don't use
exact keyword from the document.
It's matching the meaning, not just
words.
You could use other embedding provider
like OpenAI, Azure, Ollama. But the
built-in one runs completely locally and
works great.
Next is vector database. This is where
all this embedding gets stored. We are
using LanceDB. It's a building into
anything LM. Runs 100% locally.
Zero configuration needed. You could use
Chroma or Pinecone or other.
But for the private use, LanceDB is
perfect.
Think of it like pre-processed index.
Instead of searching raw text every
time, which is slow, we search these
numbers that is almost instant.
Now, let's look at the workspace level
settings. Similarity threshold. This
controls how closely a chunk must match
your questions to be included in your
result.
>> [music]
>> No restriction means everything is
considered.
Low is 25% match or higher. So, you'll
get the results, but some might be
loosely related. Medium is 50%. High is
75%.
Very strict, only most relevant chunks
make it through.
So, this is how it works. So, if you are
getting irrelevant information in your
answers, bump it up.
If you are getting no information found
when you know it exists in the document,
lower it.
And the search preference default is
fastest.
Accuracy optimized takes longer, but
find better matches.
So, now max context snippet.
How many chunks get sent to AI per
questions?
Default is four. Remember, our PDF had
21 chunks.
When you ask a questions, system find
top four most similar chunks and passes
those to AI.
More chunks means broader context, but
slower response, more tokens used. Fewer
chunks is faster, but might miss the
relevant information. So, four is
usually a good balance.
Temperature. This control how creative
the AI gets.
Zero means deterministic. Same question
gives same answers every time. Very
factual, very focused. One means
creative, more varied, more random
response.
For rag, you generally want lower
temperature, maybe 0.3 to 0.5.
You want AI to stick what's actually in
the document.
Not get creative and start making things
up. If you are seeing hallucination even
with rag, try lower this.
And finally, system prompt. These are
the instruction that AI follow for every
response.
The default works fine, but you can
customize it. For example, add something
like always cite your sources.
And if you don't find the relevant
information in the documents, say so
instead of guessing.
This makes AI much more reliable.
And these settings together determine
how well the rag system performs.
Great, that's it. Now, we have fully
working local rag system running on your
machine with a complete privacy.
And more importantly, you understand the
core pipeline and the key settings
needed [music] to tune it for your own
data.
So, in the next video, we'll explore the
Open Web UI, a beautiful ChatGPT-like
interface that adds rag, chat history,
multi-model workflows.
All the commands and the configuration
for this video are in the GitHub. Link
in the description. [music]
If this helped, hit the subscribe, drop
the comments if you got the rag working.
And if you're stuck somewhere, [music]
let me know.
I'll see you on the next one. Take care.
