---
title: 'Run Your Own Agentic AI? 🦙 Full Ollama Setup + Hermes Workflow'
source: 'https://youtube.com/watch?v=4KXLW9Y1r4c'
video_id: '4KXLW9Y1r4c'
date: 2026-06-30
duration_sec: 957
---

# Run Your Own Agentic AI? 🦙 Full Ollama Setup + Hermes Workflow

> Source: [Run Your Own Agentic AI? 🦙 Full Ollama Setup + Hermes Workflow](https://youtube.com/watch?v=4KXLW9Y1r4c)

## Summary



## Transcript

That just took, from start to finish, 1
minute or so to get a local model
running on my computer where I can now
just have a chat with it like I would
with ChatGPT, and it's 100% local. Every
single message stays entirely on my
computer. It's full privacy. Every time
you type something into AI online,
you're paying for it, either through
your data, your money, or both. But what
if you could run an AI assistant that
knew everything in your notes, could
answer questions about your own work,
and that entire conversation never left
your computer? This is what a local AI
model gives you. It's your own private
AI. It runs on your machine, it costs
nothing, and no one else will ever see
what you type. Let's get that set up.
Hi, my name is Callum, also known as
Water Lootz, and welcome to today's
video on running a local model on your
computer using Ollama. I'm using Ollama
for today's video, but the principles
I'm talking about apply to all different
kinds of local model frameworks. I've
just personally found Ollama to be the
easiest to get set up and then connect
to other tools like AI agents. In
today's video, I'll go through the three
reasons why running a local model is
worth it, how to install Ollama and get
your first model running, and how to
create a customized variant of that
model, and why agentic AI tools like
Hermes need something like that. They
need that customization. Anytime I talk
about connecting to a local model, or
using a local model, or running your own
LLM, this is what I'm talking about. If
you find this video helpful, please
like, hype, and subscribe as I really
appreciate your support a lot. If you're
looking for more ways to support me,
please consider joining my YouTube or
Patreon membership where I give more
tips, insights, and kits on the world of
AI and knowledge. Now, let's get your
own private AI running.
Before we get started on actually
running anything, installing something
on your computer, I want to talk about
the three reasons on why it's worth
running your own local model so that you
understand the rest of the video better.
The first one is privacy. When you type
into a cloud model on a website or
through an API, that conversation goes
to a server, someone else's server. With
a local model, nothing leaves your
machine, ever. That means you can feed
it sensitive notes, client work,
personal writing, whatever you want, and
that conversation, all of that
information and data, will only ever
stay on your computer. It never leaves
it. So, when you're running a local
model, you can be confident that no one
else is looking at what you're putting
into that chat. The second is cost.
Cloud AI runs on a subscription model or
a per use building per API call. A local
model downloads once and runs forever.
The only cost is the electricity to run
it, but compared to cloud services, that
cost is negligible. Third is that it
works offline. Once the model is
downloaded on your machine, on your
computer, you can use it forever from
anywhere even if you don't have internet
access. And I know all of that can feel
a little abstract if you're not sure how
you can actually begin using this local
model. So, my personal favorite method,
a great practical example, is through
Obsidian or note-taking.
If you use a note-taking app like
Obsidian as your personal knowledge
management system, your second brain,
you can connect a local model directly
to your vault, ask questions across
thousands of your own notes, build a
personal wiki, summarize your research,
and that entire conversation related to
your notes always stays private and on
your computer. It's a way to use your
own knowledge with AI and maintain
privacy. A great example of extending
that beyond your own notes is through
something called the LLM wiki, where you
can use an AI agent to summarize
information and build out your own
personal Wikipedia of information that
you're interested in. So, if you want to
learn more about the LLM wiki and how
you can use it to reduce information
overload and connect models to Obsidian,
I recommend checking out my videos on
that. So, that's the context on why you
should care about running your own local
model, but now let's actually build one.
Let's get it set up.
The easiest in my experience has been
Ollama. I've been enjoying that a lot.
And there's two different ways you can
do it. You can go to your terminal, you
can copy this curl command and just put
that in, and it will download and
install it for you. Or, you can click
download, and then you can download it
for your local device. I'm using a Mac,
so I'm going to install it like that.
Once you've downloaded Ollama, you can
check to make sure that it has installed
by typing in Ollama version. And we can
see that it's currently not running, but
it has version is 30.1. And that's it.
We have the platform that's going to be
able to run our local model for us. The
next step that we need to do is download
a model.
So, if we go to Ollama again, we see
models. There are a ton of models that
you can go through. Many of them are
good for different use cases, but a lot
of them it significantly depends on what
type of hardware do you have? What is
your computer running? So, for most
people with 8 to 16 GB of RAM, you're
probably going to want to run something
like Gemma 4. This is Google's latest
open weight model. And if we scroll down
a little bit, we can see how it's able
to be run. You can use it in Cloud Code,
CodeX OpenClaw Hermes CodeX
OpenCode, whatever you want. I've
personally been using it for Hermes a
lot lately, so I'm going to talk about
that a little bit more later. But, what
I really wanted to show you is if we
scroll down to models here, we see all
of the different variations of the
models that we can run. We can also see,
importantly, how big is the model, how
much space is it going to take up on our
hard drive, and what is the context
window, how much short-term working
memory does it have? So, what's cool
about Gemma 4 is that not only do we
have the most powerful version, for
example, 31 billion parameters, which
would take up 20 gigs, and I would
struggle to run it potentially on my 32
GB RAM computer, but we have something
called the E4B and the E2B models. So, E
stands for effective parameter. So,
basically what's happening with the
effective parameters is that it uses
something called per-layer embeddings,
which lets a smaller number of
parameters, less power, less thinking,
that operates like it's a bigger model.
So, in real-world behavior, the E4B
operates similar to the 12 billion
parameter model. And how that applies to
practical terms is something like an
E2B, you probably want to run on a
low-end laptop with only maybe 4 GB of
RAM, or on a phone. This is good for
running on a mobile device. The E4B can
handle something with 8 GB of RAM, which
should be almost everyone's laptop. And
then if you're getting into the 31,
that's where you need a much bigger RAM.
So, I'm able to run the 12 billion
parameter on my 32 GB RAM, but I figured
since most people are going to be using
the E4B, why don't we install that one
today? So, how do we install it?
Well, that's where we get into the
commands in Ollama. So, what we can do
is we can copy what this one's called,
the E4B, and we can go over to our
terminal again and type in Ollama pull
and then Gemma E4B. We click that, click
enter, and it's pulling in the manifest,
writing it, and installing. Great. So,
now what we can do is we can run Ollama
list to make sure that it works
properly. And we can see here we have
the Gemma 4 E4B model downloaded. So,
this is the one that I just installed 7
seconds ago. I also have the 12 billion
parameter and I've got a couple others
that I've installed previously. And
you'll notice too that there's something
called Gemma 4 64K. And I'm going to
explain that in a moment because it's
really important for certain use cases
with the Gentic AI. So, now why don't we
try getting this going?
To get it started, all you need to do is
write Ollama run Gemma 4 E4B. And we can
see it's thinking down in the bottom
here. And there we go. It just says send
a message. That's it. We can see it's
thinking, which is pretty cool. It knows
already in its thinking process that
it's Gemma. And here's the answer. I'm
Gemma 4, a large language model
developed by Google DeepMind. I'm an
open weights model and my purpose is to
assist you. How can I help you? So,
that's pretty cool. That just took from
start to finish maybe 1 minute or so to
get a local model running on my computer
where I can now just have a chat with it
like I would with ChatGPT. And it's 100%
local. Every single message I send to
Gemma 4 stays entirely on my computer.
It's full privacy.
And just very quickly, if you're looking
for an easier way to chat with your
local models, you can also use the
Ollama desktop app, which is what I
downloaded to install the command line
interface in terminal before, what we
were using. So, if you want to have
conversations and then you go back and
continue the conversation, you can
always switch the model like we did
here. And those conversations will stay
on the side stored on your computer in
the same way that you would use
something like chat GPT or cloud code,
but all of this is happening 100%
locally if I've selected a model that
I've downloaded here.
But that's just the beginning of using
Ollama. Yeah, you can talk with it here
in chat, but that's not really what I
want to use it for. I want to be able to
connect it to another tool so that it
becomes more powerful and whenever I
want my local agent, for example,
something like my Hermes agent to be
able to run and do things for me and
connect locally to my running model in
Ollama, I need to do a couple more
things. So the first thing we can do
here is end the conversation by typing
{slash} bye. And next, what I want to
show you is how we can go back to these
models here and we can create what's
called a variant.
So a variant is basically just a
configuration, a modified version of the
one that we already have here, but it
doesn't duplicate it. It doesn't take up
another 10 gigs on your hard drive. It's
just a different way of launching the
same underlying model. So, for example,
if I type in Ollama PS, we can see that
we have Gemma 4 E4B. It's 3.3 gigs,
operating 100% in my GPU, but its
context window is 32,768.
So, that's where we start to get into
potentially a problem depending on your
use case. For a lot of things, this is
totally fine, but for example, if I go
over to Hermes and we take a look at
running uh local model inside of Hermes
agent, we can see that a lot of Ollama
models use a 2048 token context
limitation and Hermes requires 64,000 to
give your agent tools. So, it even shows
you exactly how to get this set up
inside of Hermes, but I'm going to use
something a little bit easier.
Basically, what we need to do is we need
to create a model file that extends the
context. So, we need to create a little
temporary file, pull it from the Gemma
model that we're using, establish the
context that we want, and then get it to
create the new model based on that
modified one. So, let's just start a new
terminal here so we can paste in this
string, which I will give you in the
description or a link to an article that
has it, and we are pulling in Gemma 4
E4B, which is the model that we just
downloaded, and we're specifying that we
want the context to be 64,000. So, it's
creating a little variant file. You can
say, "Okay, here we go. Gathering model
components using existing layer.
Success." And now if we go Ollama list,
we can see we have Gemma 4E 64K variant.
So, we just created this new model 8
seconds ago, and now if we run it, and
then we type in Ollama PS, we can see
that now it's pulling in this model, and
we have the context of 64,000. Instead
of 3.3 gigs, it's now 3.4, so it's using
up a little bit more space, but we've
doubled the amount of working memory
that the model is able to use running
inside of Ollama.
And you can also control the context
window at the system level in Ollama. If
we go up to settings here, and we can
see that we have this context length.
So, I can move this up to 64, and then
whenever it launches a new model, it
will always use the 64,000. So, we don't
have to then create that variant
manually like I did, but this will force
everyone of your models to run at
64,000. So, if you only want to change
the one model, then you would do it
using the method I showed you in the
terminal. But this is a pretty easy way
to get all of your Ollama models ready
to go with AI agents. So, that becomes
really important because now we can
connect it to something like Hermes, and
we're able to actually use it. Which if
you remember here, it says that it
requires 64,000. But now that we have
the model in the format that we want,
how do we actually connect it to a tool
like Hermes or Claude Code or Codex?
This is where we get into what's called
an endpoint.
So, I'm going to use Hermes just as an
example here because this is what I've
been exploring lately, but you can use
this with any type of system you want.
But basically what we need to do is,
rather than for example pointing it to
ChatGPT, we need to point Hermes or
Claude Code or Codex to the local model
that we have sitting on our computer
right here. So, this is where we get
into what's called an endpoint, and this
is what it looks like. It's just a
string of numbers that is an address
based on what's running or can run
locally on our computer. This is an
Ollama native endpoint, and then if we
add a V1, it makes it compatible with
what's called an OpenAI endpoint. So,
that's what a lot of platforms use. So,
for example, if we go into Hermes, I'm
going to go to my Gemma profile, and I
can go to models, and I can go down to
custom endpoint and click set up custom
endpoint, and this is where, like I was
talking about it, even includes an
example that's very similar. All I have
to do is type in that string, that
address that I was showing you, and then
we click connect. We now have a custom
endpoint connected here. If I go back
down to the models, we now see that we
have this model right here, Gemma 4e 64k
latest. So, that's the same model that I
just created the variant of right here.
So, now if I go and want to have a
conversation and I type in, "Hi, who are
you?"
It might take a moment to run. If you
remember earlier when I clicked run on
Ollama, it had that little spinning icon
for a moment. So, right now what Hermes
is doing is it's spawning a little agent
inside of Ollama, a locally running
model. And then once I get that started
up, it's warmed up, I'll be able to have
a conversation, and it can run for a few
minutes before it goes back to sleep,
which I'll explain in a moment. Okay,
there we go. So, the model just woke up.
It's analyzing the request, and we can
see it's thinking right here. But this
time, rather than just saying, "Hi, I'm
Gemma." Instead, we're running it
through Hermes. So, now it recognizes
that it's a Hermes agent running the
Gemma 4 model. So, that's pretty cool.
So, we have as many different options as
we want. Once we have the models
installed on our computer, we can then
have different agents harness the power
of that model. That's why this is called
an agent harness. You could use Claude
code, you could use Codex, you could use
Open Claw, you could use Hermes. This is
kind of the beginning of fully private,
locally run AI. And what's nice too is
with an agent like this, you could have
it running 24/7, but in order to do
that, we would need to keep the model
loaded, and this is where we can modify
some more settings of Ollama so that it
doesn't unload after a few minutes. So,
you remember it took a few seconds there
to get running. We could have it be
always on ready to go. And what's cool,
too, is if you have a more powerful
system, like a desktop computer with
bigger RAM, for example, I can
potentially run a bigger parameter model
on my desktop and then connect to it
from my laptop, so I can leverage the
power of a local model running on one
computer, but then access it from
another one, which I have another video
that explains how to do that with
Hermes, if you're interested.
Like I mentioned earlier, one of the
reasons I'm most excited to run a local
model is because I can manage my
information in something like Obsidian,
and I can potentially have sensitive
information sitting inside of my
Obsidian vault, and I'm only running a
local model that's accessing that
information, so it's not being sent to a
cloud provider. So, once you begin
working with local models, it really
opens the door to a lot of interesting
workflows that maintain the privacy of
your data.
And there are so many different models
that you can use here. You can also run
embedding models for setting up like a
document retrieval system. You can run
vision, thinking, tools. There's coding
models. So, instead of paying for Claude
Code or Codex all of the time, you could
run a local model for coding. Qwen 3.6
is supposed to be incredible, and it has
a few different sizes. So, this is a
bigger model, but there's a lot that we
can do here. Running local models really
opens a lot of doors. So, I highly
recommend exploring it. Experiment with
it yourself.
And there you have it. You have your own
personal private AI running locally on
your computer with a context-tuned
variant that can be plugged into
different Aagentic AI tools. To help you
understand more of the practicality of
this, how you can use this in real-world
situations, I'm putting together an
Aagentic AI playlist using Hermes, but
other tools as well, that connect to
local models, so you can autonomously
run tasks that hopefully make your life
a little bit easier. So, I recommend
checking out that playlist if you want
to go deeper into anything that I've
talked about today or see how you can
expand this system. Also, let me know if
you have any questions about what I
talked about, as I know this can be
confusing for new users to AI, and
especially running local models on your
computer. So, if you have questions or
there's anything you'd like to see me
work on in future videos, please let me
know in the comments and I'm happy to
help you out. A reminder to please like
and subscribe if you found this video
helpful and consider sharing with a
friend who's perhaps AI curious but has
been wary of the privacy and the data
and the cost because a local model is a
great way to get people into using AI to
make their lives easier without dealing
with a lot of the potential issues that
people face with these types of tools.
Thanks for watching and I will see you
in the next video.
