Run Your Own Agentic AI? 🦙 Full Ollama Setup + Hermes Workflow

0h 15m video Transcribed Jun 30, 2026 W Wanderloots

24.0K

Views

1.1K

Likes

55

Comments

11

Dislikes

4.6%

🔥 High Engagement

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Run AI 100% Locally in 1 Minute

33s

The quick demonstration of setting up a private, local AI in under a minute is both surprising and empowering, appealing to privacy-conscious users.

▶ Play Clip

3 Reasons to Switch to Local AI

35s

The clear, relatable breakdown of privacy, cost, and offline benefits taps into widespread concerns about data security and subscription fatigue.

▶ Play Clip

Install Ollama in Seconds

37s

Showing a simple terminal command to install a powerful AI framework makes the process feel accessible and demystifies local AI for beginners.

▶ Play Clip

Chat with Your Own Private AI

41s

The live demo of chatting with a locally running model that works just like ChatGPT creates immediate 'I want this' engagement.

▶ Play Clip

Boost AI Context to 64K for Agents

50s

Solving a common frustration (limited context) with a simple terminal trick appeals to advanced users and highlights a key advantage for agentic AI workflows.

▶ Play Clip

Full Transcript

Download .txt Download .md

[00:00] That just took, from start to finish, 1

[00:01] minute or so to get a local model

[00:03] running on my computer where I can now

[00:05] just have a chat with it like I would

[00:06] with ChatGPT, and it's 100% local. Every

[00:09] single message stays entirely on my

[00:11] computer. It's full privacy. Every time

[00:13] you type something into AI online,

[00:16] you're paying for it, either through

[00:17] your data, your money, or both. But what

[00:19] if you could run an AI assistant that

[00:20] knew everything in your notes, could

[00:22] answer questions about your own work,

[00:23] and that entire conversation never left

[00:25] your computer? This is what a local AI

[00:27] model gives you. It's your own private

[00:29] AI. It runs on your machine, it costs

[00:31] nothing, and no one else will ever see

[00:33] what you type. Let's get that set up.

[00:35] Hi, my name is Callum, also known as

[00:37] Water Lootz, and welcome to today's

[00:38] video on running a local model on your

[00:40] computer using Ollama. I'm using Ollama

[00:42] for today's video, but the principles

[00:44] I'm talking about apply to all different

[00:46] kinds of local model frameworks. I've

[00:48] just personally found Ollama to be the

[00:49] easiest to get set up and then connect

[00:51] to other tools like AI agents. In

[00:53] today's video, I'll go through the three

[00:54] reasons why running a local model is

[00:56] worth it, how to install Ollama and get

[00:58] your first model running, and how to

[01:00] create a customized variant of that

[01:01] model, and why agentic AI tools like

[01:03] Hermes need something like that. They

[01:05] need that customization. Anytime I talk

[01:07] about connecting to a local model, or

[01:09] using a local model, or running your own

[01:11] LLM, this is what I'm talking about. If

[01:13] you find this video helpful, please

[01:14] like, hype, and subscribe as I really

[01:16] appreciate your support a lot. If you're

[01:18] looking for more ways to support me,

[01:19] please consider joining my YouTube or

[01:20] Patreon membership where I give more

[01:22] tips, insights, and kits on the world of

[01:24] AI and knowledge. Now, let's get your

[01:26] own private AI running.

[01:29] Before we get started on actually

[01:30] running anything, installing something

[01:32] on your computer, I want to talk about

[01:33] the three reasons on why it's worth

[01:35] running your own local model so that you

[01:37] understand the rest of the video better.

[01:39] The first one is privacy. When you type

[01:41] into a cloud model on a website or

[01:42] through an API, that conversation goes

[01:44] to a server, someone else's server. With

[01:47] a local model, nothing leaves your

[01:48] machine, ever. That means you can feed

[01:50] it sensitive notes, client work,

[01:52] personal writing, whatever you want, and

[01:54] that conversation, all of that

[01:56] information and data, will only ever

[01:58] stay on your computer. It never leaves

[01:59] it. So, when you're running a local

[02:01] model, you can be confident that no one

[02:02] else is looking at what you're putting

[02:04] into that chat. The second is cost.

[02:06] Cloud AI runs on a subscription model or

[02:09] a per use building per API call. A local

[02:11] model downloads once and runs forever.

[02:13] The only cost is the electricity to run

[02:15] it, but compared to cloud services, that

[02:17] cost is negligible. Third is that it

[02:19] works offline. Once the model is

[02:21] downloaded on your machine, on your

[02:22] computer, you can use it forever from

[02:24] anywhere even if you don't have internet

[02:26] access. And I know all of that can feel

[02:27] a little abstract if you're not sure how

[02:29] you can actually begin using this local

[02:31] model. So, my personal favorite method,

[02:33] a great practical example, is through

[02:35] Obsidian or note-taking.

[02:38] If you use a note-taking app like

[02:39] Obsidian as your personal knowledge

[02:41] management system, your second brain,

[02:43] you can connect a local model directly

[02:44] to your vault, ask questions across

[02:46] thousands of your own notes, build a

[02:48] personal wiki, summarize your research,

[02:50] and that entire conversation related to

[02:52] your notes always stays private and on

[02:54] your computer. It's a way to use your

[02:56] own knowledge with AI and maintain

[02:58] privacy. A great example of extending

[03:00] that beyond your own notes is through

[03:01] something called the LLM wiki, where you

[03:03] can use an AI agent to summarize

[03:05] information and build out your own

[03:07] personal Wikipedia of information that

[03:09] you're interested in. So, if you want to

[03:10] learn more about the LLM wiki and how

[03:12] you can use it to reduce information

[03:13] overload and connect models to Obsidian,

[03:15] I recommend checking out my videos on

[03:17] that. So, that's the context on why you

[03:18] should care about running your own local

[03:20] model, but now let's actually build one.

[03:21] Let's get it set up.

[03:24] The easiest in my experience has been

[03:25] Ollama. I've been enjoying that a lot.

[03:27] And there's two different ways you can

[03:28] do it. You can go to your terminal, you

[03:29] can copy this curl command and just put

[03:31] that in, and it will download and

[03:33] install it for you. Or, you can click

[03:35] download, and then you can download it

[03:37] for your local device. I'm using a Mac,

[03:39] so I'm going to install it like that.

[03:40] Once you've downloaded Ollama, you can

[03:41] check to make sure that it has installed

[03:43] by typing in Ollama version. And we can

[03:45] see that it's currently not running, but

[03:47] it has version is 30.1. And that's it.

[03:49] We have the platform that's going to be

[03:51] able to run our local model for us. The

[03:54] next step that we need to do is download

[03:55] a model.

[03:57] So, if we go to Ollama again, we see

[03:59] models. There are a ton of models that

[04:02] you can go through. Many of them are

[04:03] good for different use cases, but a lot

[04:05] of them it significantly depends on what

[04:08] type of hardware do you have? What is

[04:10] your computer running? So, for most

[04:12] people with 8 to 16 GB of RAM, you're

[04:14] probably going to want to run something

[04:16] like Gemma 4. This is Google's latest

[04:17] open weight model. And if we scroll down

[04:19] a little bit, we can see how it's able

[04:21] to be run. You can use it in Cloud Code,

[04:23] CodeX OpenClaw Hermes CodeX

[04:26] OpenCode, whatever you want. I've

[04:27] personally been using it for Hermes a

[04:29] lot lately, so I'm going to talk about

[04:30] that a little bit more later. But, what

[04:32] I really wanted to show you is if we

[04:33] scroll down to models here, we see all

[04:35] of the different variations of the

[04:37] models that we can run. We can also see,

[04:39] importantly, how big is the model, how

[04:41] much space is it going to take up on our

[04:43] hard drive, and what is the context

[04:44] window, how much short-term working

[04:47] memory does it have? So, what's cool

[04:49] about Gemma 4 is that not only do we

[04:51] have the most powerful version, for

[04:53] example, 31 billion parameters, which

[04:55] would take up 20 gigs, and I would

[04:57] struggle to run it potentially on my 32

[05:00] GB RAM computer, but we have something

[05:02] called the E4B and the E2B models. So, E

[05:05] stands for effective parameter. So,

[05:08] basically what's happening with the

[05:09] effective parameters is that it uses

[05:11] something called per-layer embeddings,

[05:13] which lets a smaller number of

[05:15] parameters, less power, less thinking,

[05:17] that operates like it's a bigger model.

[05:19] So, in real-world behavior, the E4B

[05:22] operates similar to the 12 billion

[05:25] parameter model. And how that applies to

[05:27] practical terms is something like an

[05:28] E2B, you probably want to run on a

[05:30] low-end laptop with only maybe 4 GB of

[05:32] RAM, or on a phone. This is good for

[05:35] running on a mobile device. The E4B can

[05:37] handle something with 8 GB of RAM, which

[05:39] should be almost everyone's laptop. And

[05:41] then if you're getting into the 31,

[05:43] that's where you need a much bigger RAM.

[05:45] So, I'm able to run the 12 billion

[05:46] parameter on my 32 GB RAM, but I figured

[05:50] since most people are going to be using

[05:51] the E4B, why don't we install that one

[05:53] today? So, how do we install it?

[05:56] Well, that's where we get into the

[05:57] commands in Ollama. So, what we can do

[06:00] is we can copy what this one's called,

[06:02] the E4B, and we can go over to our

[06:04] terminal again and type in Ollama pull

[06:06] and then Gemma E4B. We click that, click

[06:09] enter, and it's pulling in the manifest,

[06:11] writing it, and installing. Great. So,

[06:13] now what we can do is we can run Ollama

[06:15] list to make sure that it works

[06:16] properly. And we can see here we have

[06:18] the Gemma 4 E4B model downloaded. So,

[06:21] this is the one that I just installed 7

[06:23] seconds ago. I also have the 12 billion

[06:25] parameter and I've got a couple others

[06:26] that I've installed previously. And

[06:28] you'll notice too that there's something

[06:29] called Gemma 4 64K. And I'm going to

[06:31] explain that in a moment because it's

[06:33] really important for certain use cases

[06:35] with the Gentic AI. So, now why don't we

[06:37] try getting this going?

[06:39] To get it started, all you need to do is

[06:41] write Ollama run Gemma 4 E4B. And we can

[06:45] see it's thinking down in the bottom

[06:47] here. And there we go. It just says send

[06:49] a message. That's it. We can see it's

[06:50] thinking, which is pretty cool. It knows

[06:52] already in its thinking process that

[06:54] it's Gemma. And here's the answer. I'm

[06:56] Gemma 4, a large language model

[06:57] developed by Google DeepMind. I'm an

[06:59] open weights model and my purpose is to

[07:00] assist you. How can I help you? So,

[07:02] that's pretty cool. That just took from

[07:04] start to finish maybe 1 minute or so to

[07:07] get a local model running on my computer

[07:09] where I can now just have a chat with it

[07:10] like I would with ChatGPT. And it's 100%

[07:13] local. Every single message I send to

[07:15] Gemma 4 stays entirely on my computer.

[07:18] It's full privacy.

[07:21] And just very quickly, if you're looking

[07:22] for an easier way to chat with your

[07:23] local models, you can also use the

[07:25] Ollama desktop app, which is what I

[07:27] downloaded to install the command line

[07:30] interface in terminal before, what we

[07:31] were using. So, if you want to have

[07:33] conversations and then you go back and

[07:34] continue the conversation, you can

[07:36] always switch the model like we did

[07:37] here. And those conversations will stay

[07:39] on the side stored on your computer in

[07:40] the same way that you would use

[07:42] something like chat GPT or cloud code,

[07:44] but all of this is happening 100%

[07:45] locally if I've selected a model that

[07:47] I've downloaded here.

[07:50] But that's just the beginning of using

[07:51] Ollama. Yeah, you can talk with it here

[07:53] in chat, but that's not really what I

[07:55] want to use it for. I want to be able to

[07:56] connect it to another tool so that it

[07:58] becomes more powerful and whenever I

[08:00] want my local agent, for example,

[08:02] something like my Hermes agent to be

[08:04] able to run and do things for me and

[08:05] connect locally to my running model in

[08:08] Ollama, I need to do a couple more

[08:10] things. So the first thing we can do

[08:11] here is end the conversation by typing

[08:13] {slash} bye. And next, what I want to

[08:15] show you is how we can go back to these

[08:17] models here and we can create what's

[08:18] called a variant.

[08:21] So a variant is basically just a

[08:23] configuration, a modified version of the

[08:25] one that we already have here, but it

[08:27] doesn't duplicate it. It doesn't take up

[08:28] another 10 gigs on your hard drive. It's

[08:30] just a different way of launching the

[08:32] same underlying model. So, for example,

[08:34] if I type in Ollama PS, we can see that

[08:37] we have Gemma 4 E4B. It's 3.3 gigs,

[08:40] operating 100% in my GPU, but its

[08:42] context window is 32,768.

[08:45] So, that's where we start to get into

[08:47] potentially a problem depending on your

[08:49] use case. For a lot of things, this is

[08:50] totally fine, but for example, if I go

[08:53] over to Hermes and we take a look at

[08:54] running uh local model inside of Hermes

[08:57] agent, we can see that a lot of Ollama

[09:00] models use a 2048 token context

[09:02] limitation and Hermes requires 64,000 to

[09:06] give your agent tools. So, it even shows

[09:09] you exactly how to get this set up

[09:10] inside of Hermes, but I'm going to use

[09:12] something a little bit easier.

[09:13] Basically, what we need to do is we need

[09:15] to create a model file that extends the

[09:16] context. So, we need to create a little

[09:18] temporary file, pull it from the Gemma

[09:21] model that we're using, establish the

[09:23] context that we want, and then get it to

[09:24] create the new model based on that

[09:27] modified one. So, let's just start a new

[09:29] terminal here so we can paste in this

[09:30] string, which I will give you in the

[09:32] description or a link to an article that

[09:34] has it, and we are pulling in Gemma 4

[09:36] E4B, which is the model that we just

[09:38] downloaded, and we're specifying that we

[09:40] want the context to be 64,000. So, it's

[09:43] creating a little variant file. You can

[09:45] say, "Okay, here we go. Gathering model

[09:46] components using existing layer.

[09:48] Success." And now if we go Ollama list,

[09:51] we can see we have Gemma 4E 64K variant.

[09:54] So, we just created this new model 8

[09:56] seconds ago, and now if we run it, and

[09:58] then we type in Ollama PS, we can see

[10:00] that now it's pulling in this model, and

[10:02] we have the context of 64,000. Instead

[10:05] of 3.3 gigs, it's now 3.4, so it's using

[10:07] up a little bit more space, but we've

[10:09] doubled the amount of working memory

[10:11] that the model is able to use running

[10:13] inside of Ollama.

[10:15] And you can also control the context

[10:17] window at the system level in Ollama. If

[10:19] we go up to settings here, and we can

[10:20] see that we have this context length.

[10:23] So, I can move this up to 64, and then

[10:25] whenever it launches a new model, it

[10:27] will always use the 64,000. So, we don't

[10:29] have to then create that variant

[10:31] manually like I did, but this will force

[10:34] everyone of your models to run at

[10:36] 64,000. So, if you only want to change

[10:38] the one model, then you would do it

[10:40] using the method I showed you in the

[10:41] terminal. But this is a pretty easy way

[10:43] to get all of your Ollama models ready

[10:45] to go with AI agents. So, that becomes

[10:47] really important because now we can

[10:49] connect it to something like Hermes, and

[10:51] we're able to actually use it. Which if

[10:52] you remember here, it says that it

[10:53] requires 64,000. But now that we have

[10:56] the model in the format that we want,

[10:58] how do we actually connect it to a tool

[11:00] like Hermes or Claude Code or Codex?

[11:02] This is where we get into what's called

[11:04] an endpoint.

[11:06] So, I'm going to use Hermes just as an

[11:08] example here because this is what I've

[11:09] been exploring lately, but you can use

[11:10] this with any type of system you want.

[11:12] But basically what we need to do is,

[11:14] rather than for example pointing it to

[11:16] ChatGPT, we need to point Hermes or

[11:18] Claude Code or Codex to the local model

[11:21] that we have sitting on our computer

[11:23] right here. So, this is where we get

[11:24] into what's called an endpoint, and this

[11:26] is what it looks like. It's just a

[11:27] string of numbers that is an address

[11:29] based on what's running or can run

[11:31] locally on our computer. This is an

[11:33] Ollama native endpoint, and then if we

[11:36] add a V1, it makes it compatible with

[11:38] what's called an OpenAI endpoint. So,

[11:40] that's what a lot of platforms use. So,

[11:42] for example, if we go into Hermes, I'm

[11:44] going to go to my Gemma profile, and I

[11:46] can go to models, and I can go down to

[11:48] custom endpoint and click set up custom

[11:50] endpoint, and this is where, like I was

[11:52] talking about it, even includes an

[11:53] example that's very similar. All I have

[11:55] to do is type in that string, that

[11:57] address that I was showing you, and then

[11:58] we click connect. We now have a custom

[12:00] endpoint connected here. If I go back

[12:02] down to the models, we now see that we

[12:05] have this model right here, Gemma 4e 64k

[12:08] latest. So, that's the same model that I

[12:10] just created the variant of right here.

[12:12] So, now if I go and want to have a

[12:13] conversation and I type in, "Hi, who are

[12:15] you?"

[12:17] It might take a moment to run. If you

[12:18] remember earlier when I clicked run on

[12:21] Ollama, it had that little spinning icon

[12:23] for a moment. So, right now what Hermes

[12:24] is doing is it's spawning a little agent

[12:27] inside of Ollama, a locally running

[12:29] model. And then once I get that started

[12:32] up, it's warmed up, I'll be able to have

[12:33] a conversation, and it can run for a few

[12:36] minutes before it goes back to sleep,

[12:38] which I'll explain in a moment. Okay,

[12:39] there we go. So, the model just woke up.

[12:40] It's analyzing the request, and we can

[12:42] see it's thinking right here. But this

[12:44] time, rather than just saying, "Hi, I'm

[12:46] Gemma." Instead, we're running it

[12:48] through Hermes. So, now it recognizes

[12:51] that it's a Hermes agent running the

[12:53] Gemma 4 model. So, that's pretty cool.

[12:54] So, we have as many different options as

[12:56] we want. Once we have the models

[12:57] installed on our computer, we can then

[12:59] have different agents harness the power

[13:02] of that model. That's why this is called

[13:03] an agent harness. You could use Claude

[13:05] code, you could use Codex, you could use

[13:07] Open Claw, you could use Hermes. This is

[13:09] kind of the beginning of fully private,

[13:12] locally run AI. And what's nice too is

[13:14] with an agent like this, you could have

[13:15] it running 24/7, but in order to do

[13:17] that, we would need to keep the model

[13:18] loaded, and this is where we can modify

[13:20] some more settings of Ollama so that it

[13:22] doesn't unload after a few minutes. So,

[13:24] you remember it took a few seconds there

[13:26] to get running. We could have it be

[13:28] always on ready to go. And what's cool,

[13:30] too, is if you have a more powerful

[13:31] system, like a desktop computer with

[13:33] bigger RAM, for example, I can

[13:35] potentially run a bigger parameter model

[13:37] on my desktop and then connect to it

[13:39] from my laptop, so I can leverage the

[13:41] power of a local model running on one

[13:43] computer, but then access it from

[13:45] another one, which I have another video

[13:46] that explains how to do that with

[13:47] Hermes, if you're interested.

[13:50] Like I mentioned earlier, one of the

[13:52] reasons I'm most excited to run a local

[13:53] model is because I can manage my

[13:55] information in something like Obsidian,

[13:57] and I can potentially have sensitive

[13:58] information sitting inside of my

[14:00] Obsidian vault, and I'm only running a

[14:02] local model that's accessing that

[14:03] information, so it's not being sent to a

[14:05] cloud provider. So, once you begin

[14:07] working with local models, it really

[14:08] opens the door to a lot of interesting

[14:11] workflows that maintain the privacy of

[14:13] your data.

[14:15] And there are so many different models

[14:17] that you can use here. You can also run

[14:18] embedding models for setting up like a

[14:21] document retrieval system. You can run

[14:23] vision, thinking, tools. There's coding

[14:26] models. So, instead of paying for Claude

[14:28] Code or Codex all of the time, you could

[14:31] run a local model for coding. Qwen 3.6

[14:33] is supposed to be incredible, and it has

[14:35] a few different sizes. So, this is a

[14:37] bigger model, but there's a lot that we

[14:39] can do here. Running local models really

[14:41] opens a lot of doors. So, I highly

[14:43] recommend exploring it. Experiment with

[14:45] it yourself.

[14:47] And there you have it. You have your own

[14:49] personal private AI running locally on

[14:51] your computer with a context-tuned

[14:54] variant that can be plugged into

[14:55] different Aagentic AI tools. To help you

[14:57] understand more of the practicality of

[14:58] this, how you can use this in real-world

[15:00] situations, I'm putting together an

[15:01] Aagentic AI playlist using Hermes, but

[15:04] other tools as well, that connect to

[15:05] local models, so you can autonomously

[15:07] run tasks that hopefully make your life

[15:09] a little bit easier. So, I recommend

[15:11] checking out that playlist if you want

[15:12] to go deeper into anything that I've

[15:13] talked about today or see how you can

[15:15] expand this system. Also, let me know if

[15:17] you have any questions about what I

[15:18] talked about, as I know this can be

[15:19] confusing for new users to AI, and

[15:21] especially running local models on your

[15:23] computer. So, if you have questions or

[15:24] there's anything you'd like to see me

[15:25] work on in future videos, please let me

[15:27] know in the comments and I'm happy to

[15:28] help you out. A reminder to please like

[15:29] and subscribe if you found this video

[15:31] helpful and consider sharing with a

[15:32] friend who's perhaps AI curious but has

[15:34] been wary of the privacy and the data

[15:37] and the cost because a local model is a

[15:39] great way to get people into using AI to

[15:42] make their lives easier without dealing

[15:43] with a lot of the potential issues that

[15:45] people face with these types of tools.

[15:47] Thanks for watching and I will see you

[15:49] in the next video.

W

Wanderloots

View channel analytics →