TubeSum ← Transcribe a video

Run Your Own Agentic AI? πŸ¦™ Full Ollama Setup + Hermes Workflow

0h 15m video Transcribed Jun 30, 2026 W Wanderloots
24.0K
Views
1.1K
Likes
55
Comments
11
Dislikes
4.6%
πŸ”₯ High Engagement

βœ‚οΈ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Run AI 100% Locally in 1 Minute

33s

The quick demonstration of setting up a private, local AI in under a minute is both surprising and empowering, appealing to privacy-conscious users.

β–Ά Play Clip

3 Reasons to Switch to Local AI

35s

The clear, relatable breakdown of privacy, cost, and offline benefits taps into widespread concerns about data security and subscription fatigue.

β–Ά Play Clip

Install Ollama in Seconds

37s

Showing a simple terminal command to install a powerful AI framework makes the process feel accessible and demystifies local AI for beginners.

β–Ά Play Clip

Chat with Your Own Private AI

41s

The live demo of chatting with a locally running model that works just like ChatGPT creates immediate 'I want this' engagement.

β–Ά Play Clip

Boost AI Context to 64K for Agents

50s

Solving a common frustration (limited context) with a simple terminal trick appeals to advanced users and highlights a key advantage for agentic AI workflows.

β–Ά Play Clip

[00:00] That just took, from start to finish, 1

[00:01] minute or so to get a local model

[00:03] running on my computer where I can now

[00:05] just have a chat with it like I would

[00:06] with ChatGPT, and it's 100% local. Every

[00:09] single message stays entirely on my

[00:11] computer. It's full privacy. Every time

[00:13] you type something into AI online,

[00:16] you're paying for it, either through

[00:17] your data, your money, or both. But what

[00:19] if you could run an AI assistant that

[00:20] knew everything in your notes, could

[00:22] answer questions about your own work,

[00:23] and that entire conversation never left

[00:25] your computer? This is what a local AI

[00:27] model gives you. It's your own private

[00:29] AI. It runs on your machine, it costs

[00:31] nothing, and no one else will ever see

[00:33] what you type. Let's get that set up.

[00:35] Hi, my name is Callum, also known as

[00:37] Water Lootz, and welcome to today's

[00:38] video on running a local model on your

[00:40] computer using Ollama. I'm using Ollama

[00:42] for today's video, but the principles

[00:44] I'm talking about apply to all different

[00:46] kinds of local model frameworks. I've

[00:48] just personally found Ollama to be the

[00:49] easiest to get set up and then connect

[00:51] to other tools like AI agents. In

[00:53] today's video, I'll go through the three

[00:54] reasons why running a local model is

[00:56] worth it, how to install Ollama and get

[00:58] your first model running, and how to

[01:00] create a customized variant of that

[01:01] model, and why agentic AI tools like

[01:03] Hermes need something like that. They

[01:05] need that customization. Anytime I talk

[01:07] about connecting to a local model, or

[01:09] using a local model, or running your own

[01:11] LLM, this is what I'm talking about. If

[01:13] you find this video helpful, please

[01:14] like, hype, and subscribe as I really

[01:16] appreciate your support a lot. If you're

[01:18] looking for more ways to support me,

[01:19] please consider joining my YouTube or

[01:20] Patreon membership where I give more

[01:22] tips, insights, and kits on the world of

[01:24] AI and knowledge. Now, let's get your

[01:26] own private AI running.

[01:29] Before we get started on actually

[01:30] running anything, installing something

[01:32] on your computer, I want to talk about

[01:33] the three reasons on why it's worth

[01:35] running your own local model so that you

[01:37] understand the rest of the video better.

[01:39] The first one is privacy. When you type

[01:41] into a cloud model on a website or

[01:42] through an API, that conversation goes

[01:44] to a server, someone else's server. With

[01:47] a local model, nothing leaves your

[01:48] machine, ever. That means you can feed

[01:50] it sensitive notes, client work,

[01:52] personal writing, whatever you want, and

[01:54] that conversation, all of that

[01:56] information and data, will only ever

[01:58] stay on your computer. It never leaves

[01:59] it. So, when you're running a local

[02:01] model, you can be confident that no one

[02:02] else is looking at what you're putting

[02:04] into that chat. The second is cost.

[02:06] Cloud AI runs on a subscription model or

[02:09] a per use building per API call. A local

[02:11] model downloads once and runs forever.

[02:13] The only cost is the electricity to run

[02:15] it, but compared to cloud services, that

[02:17] cost is negligible. Third is that it

[02:19] works offline. Once the model is

[02:21] downloaded on your machine, on your

[02:22] computer, you can use it forever from

[02:24] anywhere even if you don't have internet

[02:26] access. And I know all of that can feel

[02:27] a little abstract if you're not sure how

[02:29] you can actually begin using this local

[02:31] model. So, my personal favorite method,

[02:33] a great practical example, is through

[02:35] Obsidian or note-taking.

[02:38] If you use a note-taking app like

[02:39] Obsidian as your personal knowledge

[02:41] management system, your second brain,

[02:43] you can connect a local model directly

[02:44] to your vault, ask questions across

[02:46] thousands of your own notes, build a

[02:48] personal wiki, summarize your research,

[02:50] and that entire conversation related to

[02:52] your notes always stays private and on

[02:54] your computer. It's a way to use your

[02:56] own knowledge with AI and maintain

[02:58] privacy. A great example of extending

[03:00] that beyond your own notes is through

[03:01] something called the LLM wiki, where you

[03:03] can use an AI agent to summarize

[03:05] information and build out your own

[03:07] personal Wikipedia of information that

[03:09] you're interested in. So, if you want to

[03:10] learn more about the LLM wiki and how

[03:12] you can use it to reduce information

[03:13] overload and connect models to Obsidian,

[03:15] I recommend checking out my videos on

[03:17] that. So, that's the context on why you

[03:18] should care about running your own local

[03:20] model, but now let's actually build one.

[03:21] Let's get it set up.

[03:24] The easiest in my experience has been

[03:25] Ollama. I've been enjoying that a lot.

[03:27] And there's two different ways you can

[03:28] do it. You can go to your terminal, you

[03:29] can copy this curl command and just put

[03:31] that in, and it will download and

[03:33] install it for you. Or, you can click

[03:35] download, and then you can download it

[03:37] for your local device. I'm using a Mac,

[03:39] so I'm going to install it like that.

[03:40] Once you've downloaded Ollama, you can

[03:41] check to make sure that it has installed

[03:43] by typing in Ollama version. And we can

[03:45] see that it's currently not running, but

[03:47] it has version is 30.1. And that's it.

[03:49] We have the platform that's going to be

[03:51] able to run our local model for us. The

[03:54] next step that we need to do is download

[03:55] a model.

[03:57] So, if we go to Ollama again, we see

[03:59] models. There are a ton of models that

[04:02] you can go through. Many of them are

[04:03] good for different use cases, but a lot

[04:05] of them it significantly depends on what

[04:08] type of hardware do you have? What is

[04:10] your computer running? So, for most

[04:12] people with 8 to 16 GB of RAM, you're

[04:14] probably going to want to run something

[04:16] like Gemma 4. This is Google's latest

[04:17] open weight model. And if we scroll down

[04:19] a little bit, we can see how it's able

[04:21] to be run. You can use it in Cloud Code,

[04:23] CodeX OpenClaw Hermes CodeX

[04:26] OpenCode, whatever you want. I've

[04:27] personally been using it for Hermes a

[04:29] lot lately, so I'm going to talk about

[04:30] that a little bit more later. But, what

[04:32] I really wanted to show you is if we

[04:33] scroll down to models here, we see all

[04:35] of the different variations of the

[04:37] models that we can run. We can also see,

[04:39] importantly, how big is the model, how

[04:41] much space is it going to take up on our

[04:43] hard drive, and what is the context

[04:44] window, how much short-term working

[04:47] memory does it have? So, what's cool

[04:49] about Gemma 4 is that not only do we

[04:51] have the most powerful version, for

[04:53] example, 31 billion parameters, which

[04:55] would take up 20 gigs, and I would

[04:57] struggle to run it potentially on my 32

[05:00] GB RAM computer, but we have something

[05:02] called the E4B and the E2B models. So, E

[05:05] stands for effective parameter. So,

[05:08] basically what's happening with the

[05:09] effective parameters is that it uses

[05:11] something called per-layer embeddings,

[05:13] which lets a smaller number of

[05:15] parameters, less power, less thinking,

[05:17] that operates like it's a bigger model.

[05:19] So, in real-world behavior, the E4B

[05:22] operates similar to the 12 billion

[05:25] parameter model. And how that applies to

[05:27] practical terms is something like an

[05:28] E2B, you probably want to run on a

[05:30] low-end laptop with only maybe 4 GB of

[05:32] RAM, or on a phone. This is good for

[05:35] running on a mobile device. The E4B can

[05:37] handle something with 8 GB of RAM, which

[05:39] should be almost everyone's laptop. And

[05:41] then if you're getting into the 31,

[05:43] that's where you need a much bigger RAM.

[05:45] So, I'm able to run the 12 billion

[05:46] parameter on my 32 GB RAM, but I figured

[05:50] since most people are going to be using

[05:51] the E4B, why don't we install that one

[05:53] today? So, how do we install it?

[05:56] Well, that's where we get into the

[05:57] commands in Ollama. So, what we can do

[06:00] is we can copy what this one's called,

[06:02] the E4B, and we can go over to our

[06:04] terminal again and type in Ollama pull

[06:06] and then Gemma E4B. We click that, click

[06:09] enter, and it's pulling in the manifest,

[06:11] writing it, and installing. Great. So,

[06:13] now what we can do is we can run Ollama

[06:15] list to make sure that it works

[06:16] properly. And we can see here we have

[06:18] the Gemma 4 E4B model downloaded. So,

[06:21] this is the one that I just installed 7

[06:23] seconds ago. I also have the 12 billion

[06:25] parameter and I've got a couple others

[06:26] that I've installed previously. And

[06:28] you'll notice too that there's something

[06:29] called Gemma 4 64K. And I'm going to

[06:31] explain that in a moment because it's

[06:33] really important for certain use cases

[06:35] with the Gentic AI. So, now why don't we

[06:37] try getting this going?

[06:39] To get it started, all you need to do is

[06:41] write Ollama run Gemma 4 E4B. And we can

[06:45] see it's thinking down in the bottom

[06:47] here. And there we go. It just says send

[06:49] a message. That's it. We can see it's

[06:50] thinking, which is pretty cool. It knows

[06:52] already in its thinking process that

[06:54] it's Gemma. And here's the answer. I'm

[06:56] Gemma 4, a large language model

[06:57] developed by Google DeepMind. I'm an

[06:59] open weights model and my purpose is to

[07:00] assist you. How can I help you? So,

[07:02] that's pretty cool. That just took from

[07:04] start to finish maybe 1 minute or so to

[07:07] get a local model running on my computer

[07:09] where I can now just have a chat with it

[07:10] like I would with ChatGPT. And it's 100%

[07:13] local. Every single message I send to

[07:15] Gemma 4 stays entirely on my computer.

[07:18] It's full privacy.

[07:21] And just very quickly, if you're looking

[07:22] for an easier way to chat with your

[07:23] local models, you can also use the

[07:25] Ollama desktop app, which is what I

[07:27] downloaded to install the command line

[07:30] interface in terminal before, what we

[07:31] were using. So, if you want to have

[07:33] conversations and then you go back and

[07:34] continue the conversation, you can

[07:36] always switch the model like we did

[07:37] here. And those conversations will stay

[07:39] on the side stored on your computer in

[07:40] the same way that you would use

[07:42] something like chat GPT or cloud code,

[07:44] but all of this is happening 100%

[07:45] locally if I've selected a model that

[07:47] I've downloaded here.

[07:50] But that's just the beginning of using

[07:51] Ollama. Yeah, you can talk with it here

[07:53] in chat, but that's not really what I

[07:55] want to use it for. I want to be able to

[07:56] connect it to another tool so that it

[07:58] becomes more powerful and whenever I

[08:00] want my local agent, for example,

[08:02] something like my Hermes agent to be

[08:04] able to run and do things for me and

[08:05] connect locally to my running model in

[08:08] Ollama, I need to do a couple more

[08:10] things. So the first thing we can do

[08:11] here is end the conversation by typing

[08:13] {slash} bye. And next, what I want to

[08:15] show you is how we can go back to these

[08:17] models here and we can create what's

[08:18] called a variant.

[08:21] So a variant is basically just a

[08:23] configuration, a modified version of the

[08:25] one that we already have here, but it

[08:27] doesn't duplicate it. It doesn't take up

[08:28] another 10 gigs on your hard drive. It's

[08:30] just a different way of launching the

[08:32] same underlying model. So, for example,

[08:34] if I type in Ollama PS, we can see that

[08:37] we have Gemma 4 E4B. It's 3.3 gigs,

[08:40] operating 100% in my GPU, but its

[08:42] context window is 32,768.

[08:45] So, that's where we start to get into

[08:47] potentially a problem depending on your

[08:49] use case. For a lot of things, this is

[08:50] totally fine, but for example, if I go

[08:53] over to Hermes and we take a look at

[08:54] running uh local model inside of Hermes

[08:57] agent, we can see that a lot of Ollama

[09:00] models use a 2048 token context

[09:02] limitation and Hermes requires 64,000 to

[09:06] give your agent tools. So, it even shows

[09:09] you exactly how to get this set up

[09:10] inside of Hermes, but I'm going to use

[09:12] something a little bit easier.

[09:13] Basically, what we need to do is we need

[09:15] to create a model file that extends the

[09:16] context. So, we need to create a little

[09:18] temporary file, pull it from the Gemma

[09:21] model that we're using, establish the

[09:23] context that we want, and then get it to

[09:24] create the new model based on that

[09:27] modified one. So, let's just start a new

[09:29] terminal here so we can paste in this

[09:30] string, which I will give you in the

[09:32] description or a link to an article that

[09:34] has it, and we are pulling in Gemma 4

[09:36] E4B, which is the model that we just

[09:38] downloaded, and we're specifying that we

[09:40] want the context to be 64,000. So, it's

[09:43] creating a little variant file. You can

[09:45] say, "Okay, here we go. Gathering model

[09:46] components using existing layer.

[09:48] Success." And now if we go Ollama list,

[09:51] we can see we have Gemma 4E 64K variant.

[09:54] So, we just created this new model 8

[09:56] seconds ago, and now if we run it, and

[09:58] then we type in Ollama PS, we can see

[10:00] that now it's pulling in this model, and

[10:02] we have the context of 64,000. Instead

[10:05] of 3.3 gigs, it's now 3.4, so it's using

[10:07] up a little bit more space, but we've

[10:09] doubled the amount of working memory

[10:11] that the model is able to use running

[10:13] inside of Ollama.

[10:15] And you can also control the context

[10:17] window at the system level in Ollama. If

[10:19] we go up to settings here, and we can

[10:20] see that we have this context length.

[10:23] So, I can move this up to 64, and then

[10:25] whenever it launches a new model, it

[10:27] will always use the 64,000. So, we don't

[10:29] have to then create that variant

[10:31] manually like I did, but this will force

[10:34] everyone of your models to run at

[10:36] 64,000. So, if you only want to change

[10:38] the one model, then you would do it

[10:40] using the method I showed you in the

[10:41] terminal. But this is a pretty easy way

[10:43] to get all of your Ollama models ready

[10:45] to go with AI agents. So, that becomes

[10:47] really important because now we can

[10:49] connect it to something like Hermes, and

[10:51] we're able to actually use it. Which if

[10:52] you remember here, it says that it

[10:53] requires 64,000. But now that we have

[10:56] the model in the format that we want,

[10:58] how do we actually connect it to a tool

[11:00] like Hermes or Claude Code or Codex?

[11:02] This is where we get into what's called

[11:04] an endpoint.

[11:06] So, I'm going to use Hermes just as an

[11:08] example here because this is what I've

[11:09] been exploring lately, but you can use

[11:10] this with any type of system you want.

[11:12] But basically what we need to do is,

[11:14] rather than for example pointing it to

[11:16] ChatGPT, we need to point Hermes or

[11:18] Claude Code or Codex to the local model

[11:21] that we have sitting on our computer

[11:23] right here. So, this is where we get

[11:24] into what's called an endpoint, and this

[11:26] is what it looks like. It's just a

[11:27] string of numbers that is an address

[11:29] based on what's running or can run

[11:31] locally on our computer. This is an

[11:33] Ollama native endpoint, and then if we

[11:36] add a V1, it makes it compatible with

[11:38] what's called an OpenAI endpoint. So,

[11:40] that's what a lot of platforms use. So,

[11:42] for example, if we go into Hermes, I'm

[11:44] going to go to my Gemma profile, and I

[11:46] can go to models, and I can go down to

[11:48] custom endpoint and click set up custom

[11:50] endpoint, and this is where, like I was

[11:52] talking about it, even includes an

[11:53] example that's very similar. All I have

[11:55] to do is type in that string, that

[11:57] address that I was showing you, and then

[11:58] we click connect. We now have a custom

[12:00] endpoint connected here. If I go back

[12:02] down to the models, we now see that we

[12:05] have this model right here, Gemma 4e 64k

[12:08] latest. So, that's the same model that I

[12:10] just created the variant of right here.

[12:12] So, now if I go and want to have a

[12:13] conversation and I type in, "Hi, who are

[12:15] you?"

[12:17] It might take a moment to run. If you

[12:18] remember earlier when I clicked run on

[12:21] Ollama, it had that little spinning icon

[12:23] for a moment. So, right now what Hermes

[12:24] is doing is it's spawning a little agent

[12:27] inside of Ollama, a locally running

[12:29] model. And then once I get that started

[12:32] up, it's warmed up, I'll be able to have

[12:33] a conversation, and it can run for a few

[12:36] minutes before it goes back to sleep,

[12:38] which I'll explain in a moment. Okay,

[12:39] there we go. So, the model just woke up.

[12:40] It's analyzing the request, and we can

[12:42] see it's thinking right here. But this

[12:44] time, rather than just saying, "Hi, I'm

[12:46] Gemma." Instead, we're running it

[12:48] through Hermes. So, now it recognizes

[12:51] that it's a Hermes agent running the

[12:53] Gemma 4 model. So, that's pretty cool.

[12:54] So, we have as many different options as

[12:56] we want. Once we have the models

[12:57] installed on our computer, we can then

[12:59] have different agents harness the power

[13:02] of that model. That's why this is called

[13:03] an agent harness. You could use Claude

[13:05] code, you could use Codex, you could use

[13:07] Open Claw, you could use Hermes. This is

[13:09] kind of the beginning of fully private,

[13:12] locally run AI. And what's nice too is

[13:14] with an agent like this, you could have

[13:15] it running 24/7, but in order to do

[13:17] that, we would need to keep the model

[13:18] loaded, and this is where we can modify

[13:20] some more settings of Ollama so that it

[13:22] doesn't unload after a few minutes. So,

[13:24] you remember it took a few seconds there

[13:26] to get running. We could have it be

[13:28] always on ready to go. And what's cool,

[13:30] too, is if you have a more powerful

[13:31] system, like a desktop computer with

[13:33] bigger RAM, for example, I can

[13:35] potentially run a bigger parameter model

[13:37] on my desktop and then connect to it

[13:39] from my laptop, so I can leverage the

[13:41] power of a local model running on one

[13:43] computer, but then access it from

[13:45] another one, which I have another video

[13:46] that explains how to do that with

[13:47] Hermes, if you're interested.

[13:50] Like I mentioned earlier, one of the

[13:52] reasons I'm most excited to run a local

[13:53] model is because I can manage my

[13:55] information in something like Obsidian,

[13:57] and I can potentially have sensitive

[13:58] information sitting inside of my

[14:00] Obsidian vault, and I'm only running a

[14:02] local model that's accessing that

[14:03] information, so it's not being sent to a

[14:05] cloud provider. So, once you begin

[14:07] working with local models, it really

[14:08] opens the door to a lot of interesting

[14:11] workflows that maintain the privacy of

[14:13] your data.

[14:15] And there are so many different models

[14:17] that you can use here. You can also run

[14:18] embedding models for setting up like a

[14:21] document retrieval system. You can run

[14:23] vision, thinking, tools. There's coding

[14:26] models. So, instead of paying for Claude

[14:28] Code or Codex all of the time, you could

[14:31] run a local model for coding. Qwen 3.6

[14:33] is supposed to be incredible, and it has

[14:35] a few different sizes. So, this is a

[14:37] bigger model, but there's a lot that we

[14:39] can do here. Running local models really

[14:41] opens a lot of doors. So, I highly

[14:43] recommend exploring it. Experiment with

[14:45] it yourself.

[14:47] And there you have it. You have your own

[14:49] personal private AI running locally on

[14:51] your computer with a context-tuned

[14:54] variant that can be plugged into

[14:55] different Aagentic AI tools. To help you

[14:57] understand more of the practicality of

[14:58] this, how you can use this in real-world

[15:00] situations, I'm putting together an

[15:01] Aagentic AI playlist using Hermes, but

[15:04] other tools as well, that connect to

[15:05] local models, so you can autonomously

[15:07] run tasks that hopefully make your life

[15:09] a little bit easier. So, I recommend

[15:11] checking out that playlist if you want

[15:12] to go deeper into anything that I've

[15:13] talked about today or see how you can

[15:15] expand this system. Also, let me know if

[15:17] you have any questions about what I

[15:18] talked about, as I know this can be

[15:19] confusing for new users to AI, and

[15:21] especially running local models on your

[15:23] computer. So, if you have questions or

[15:24] there's anything you'd like to see me

[15:25] work on in future videos, please let me

[15:27] know in the comments and I'm happy to

[15:28] help you out. A reminder to please like

[15:29] and subscribe if you found this video

[15:31] helpful and consider sharing with a

[15:32] friend who's perhaps AI curious but has

[15:34] been wary of the privacy and the data

[15:37] and the cost because a local model is a

[15:39] great way to get people into using AI to

[15:42] make their lives easier without dealing

[15:43] with a lot of the potential issues that

[15:45] people face with these types of tools.

[15:47] Thanks for watching and I will see you

[15:49] in the next video.

⚑ Saved you 0h 15m reading this? Transcribe any YouTube video for free β€” no signup needed.