[0:00] That just took, from start to finish, 1
[0:01] minute or so to get a local model
[0:03] running on my computer where I can now
[0:05] just have a chat with it like I would
[0:06] with ChatGPT, and it's 100% local. Every
[0:09] single message stays entirely on my
[0:11] computer. It's full privacy. Every time
[0:13] you type something into AI online,
[0:16] you're paying for it, either through
[0:17] your data, your money, or both. But what
[0:19] if you could run an AI assistant that
[0:20] knew everything in your notes, could
[0:22] answer questions about your own work,
[0:23] and that entire conversation never left
[0:25] your computer? This is what a local AI
[0:27] model gives you. It's your own private
[0:29] AI. It runs on your machine, it costs
[0:31] nothing, and no one else will ever see
[0:33] what you type. Let's get that set up.
[0:35] Hi, my name is Callum, also known as
[0:37] Water Lootz, and welcome to today's
[0:38] video on running a local model on your
[0:40] computer using Ollama. I'm using Ollama
[0:42] for today's video, but the principles
[0:44] I'm talking about apply to all different
[0:46] kinds of local model frameworks. I've
[0:48] just personally found Ollama to be the
[0:49] easiest to get set up and then connect
[0:51] to other tools like AI agents. In
[0:53] today's video, I'll go through the three
[0:54] reasons why running a local model is
[0:56] worth it, how to install Ollama and get
[0:58] your first model running, and how to
[1:00] create a customized variant of that
[1:01] model, and why agentic AI tools like
[1:03] Hermes need something like that. They
[1:05] need that customization. Anytime I talk
[1:07] about connecting to a local model, or
[1:09] using a local model, or running your own
[1:11] LLM, this is what I'm talking about. If
[1:13] you find this video helpful, please
[1:14] like, hype, and subscribe as I really
[1:16] appreciate your support a lot. If you're
[1:18] looking for more ways to support me,
[1:19] please consider joining my YouTube or
[1:20] Patreon membership where I give more
[1:22] tips, insights, and kits on the world of
[1:24] AI and knowledge. Now, let's get your
[1:26] own private AI running.
[1:29] Before we get started on actually
[1:30] running anything, installing something
[1:32] on your computer, I want to talk about
[1:33] the three reasons on why it's worth
[1:35] running your own local model so that you
[1:37] understand the rest of the video better.
[1:39] The first one is privacy. When you type
[1:41] into a cloud model on a website or
[1:42] through an API, that conversation goes
[1:44] to a server, someone else's server. With
[1:47] a local model, nothing leaves your
[1:48] machine, ever. That means you can feed
[1:50] it sensitive notes, client work,
[1:52] personal writing, whatever you want, and
[1:54] that conversation, all of that
[1:56] information and data, will only ever
[1:58] stay on your computer. It never leaves
[1:59] it. So, when you're running a local
[2:01] model, you can be confident that no one
[2:02] else is looking at what you're putting
[2:04] into that chat. The second is cost.
[2:06] Cloud AI runs on a subscription model or
[2:09] a per use building per API call. A local
[2:11] model downloads once and runs forever.
[2:13] The only cost is the electricity to run
[2:15] it, but compared to cloud services, that
[2:17] cost is negligible. Third is that it
[2:19] works offline. Once the model is
[2:21] downloaded on your machine, on your
[2:22] computer, you can use it forever from
[2:24] anywhere even if you don't have internet
[2:26] access. And I know all of that can feel
[2:27] a little abstract if you're not sure how
[2:29] you can actually begin using this local
[2:31] model. So, my personal favorite method,
[2:33] a great practical example, is through
[2:35] Obsidian or note-taking.
[2:38] If you use a note-taking app like
[2:39] Obsidian as your personal knowledge
[2:41] management system, your second brain,
[2:43] you can connect a local model directly
[2:44] to your vault, ask questions across
[2:46] thousands of your own notes, build a
[2:48] personal wiki, summarize your research,
[2:50] and that entire conversation related to
[2:52] your notes always stays private and on
[2:54] your computer. It's a way to use your
[2:56] own knowledge with AI and maintain
[2:58] privacy. A great example of extending
[3:00] that beyond your own notes is through
[3:01] something called the LLM wiki, where you
[3:03] can use an AI agent to summarize
[3:05] information and build out your own
[3:07] personal Wikipedia of information that
[3:09] you're interested in. So, if you want to
[3:10] learn more about the LLM wiki and how
[3:12] you can use it to reduce information
[3:13] overload and connect models to Obsidian,
[3:15] I recommend checking out my videos on
[3:17] that. So, that's the context on why you
[3:18] should care about running your own local
[3:20] model, but now let's actually build one.
[3:21] Let's get it set up.
[3:24] The easiest in my experience has been
[3:25] Ollama. I've been enjoying that a lot.
[3:27] And there's two different ways you can
[3:28] do it. You can go to your terminal, you
[3:29] can copy this curl command and just put
[3:31] that in, and it will download and
[3:33] install it for you. Or, you can click
[3:35] download, and then you can download it
[3:37] for your local device. I'm using a Mac,
[3:39] so I'm going to install it like that.
[3:40] Once you've downloaded Ollama, you can
[3:41] check to make sure that it has installed
[3:43] by typing in Ollama version. And we can
[3:45] see that it's currently not running, but
[3:47] it has version is 30.1. And that's it.
[3:49] We have the platform that's going to be
[3:51] able to run our local model for us. The
[3:54] next step that we need to do is download
[3:55] a model.
[3:57] So, if we go to Ollama again, we see
[3:59] models. There are a ton of models that
[4:02] you can go through. Many of them are
[4:03] good for different use cases, but a lot
[4:05] of them it significantly depends on what
[4:08] type of hardware do you have? What is
[4:10] your computer running? So, for most
[4:12] people with 8 to 16 GB of RAM, you're
[4:14] probably going to want to run something
[4:16] like Gemma 4. This is Google's latest
[4:17] open weight model. And if we scroll down
[4:19] a little bit, we can see how it's able
[4:21] to be run. You can use it in Cloud Code,
[4:23] CodeX OpenClaw Hermes CodeX
[4:26] OpenCode, whatever you want. I've
[4:27] personally been using it for Hermes a
[4:29] lot lately, so I'm going to talk about
[4:30] that a little bit more later. But, what
[4:32] I really wanted to show you is if we
[4:33] scroll down to models here, we see all
[4:35] of the different variations of the
[4:37] models that we can run. We can also see,
[4:39] importantly, how big is the model, how
[4:41] much space is it going to take up on our
[4:43] hard drive, and what is the context
[4:44] window, how much short-term working
[4:47] memory does it have? So, what's cool
[4:49] about Gemma 4 is that not only do we
[4:51] have the most powerful version, for
[4:53] example, 31 billion parameters, which
[4:55] would take up 20 gigs, and I would
[4:57] struggle to run it potentially on my 32
[5:00] GB RAM computer, but we have something
[5:02] called the E4B and the E2B models. So, E
[5:05] stands for effective parameter. So,
[5:08] basically what's happening with the
[5:09] effective parameters is that it uses
[5:11] something called per-layer embeddings,
[5:13] which lets a smaller number of
[5:15] parameters, less power, less thinking,
[5:17] that operates like it's a bigger model.
[5:19] So, in real-world behavior, the E4B
[5:22] operates similar to the 12 billion
[5:25] parameter model. And how that applies to
[5:27] practical terms is something like an
[5:28] E2B, you probably want to run on a
[5:30] low-end laptop with only maybe 4 GB of
[5:32] RAM, or on a phone. This is good for
[5:35] running on a mobile device. The E4B can
[5:37] handle something with 8 GB of RAM, which
[5:39] should be almost everyone's laptop. And
[5:41] then if you're getting into the 31,
[5:43] that's where you need a much bigger RAM.
[5:45] So, I'm able to run the 12 billion
[5:46] parameter on my 32 GB RAM, but I figured
[5:50] since most people are going to be using
[5:51] the E4B, why don't we install that one
[5:53] today? So, how do we install it?
[5:56] Well, that's where we get into the
[5:57] commands in Ollama. So, what we can do
[6:00] is we can copy what this one's called,
[6:02] the E4B, and we can go over to our
[6:04] terminal again and type in Ollama pull
[6:06] and then Gemma E4B. We click that, click
[6:09] enter, and it's pulling in the manifest,
[6:11] writing it, and installing. Great. So,
[6:13] now what we can do is we can run Ollama
[6:15] list to make sure that it works
[6:16] properly. And we can see here we have
[6:18] the Gemma 4 E4B model downloaded. So,
[6:21] this is the one that I just installed 7
[6:23] seconds ago. I also have the 12 billion
[6:25] parameter and I've got a couple others
[6:26] that I've installed previously. And
[6:28] you'll notice too that there's something
[6:29] called Gemma 4 64K. And I'm going to
[6:31] explain that in a moment because it's
[6:33] really important for certain use cases
[6:35] with the Gentic AI. So, now why don't we
[6:37] try getting this going?
[6:39] To get it started, all you need to do is
[6:41] write Ollama run Gemma 4 E4B. And we can
[6:45] see it's thinking down in the bottom
[6:47] here. And there we go. It just says send
[6:49] a message. That's it. We can see it's
[6:50] thinking, which is pretty cool. It knows
[6:52] already in its thinking process that
[6:54] it's Gemma. And here's the answer. I'm
[6:56] Gemma 4, a large language model
[6:57] developed by Google DeepMind. I'm an
[6:59] open weights model and my purpose is to
[7:00] assist you. How can I help you? So,
[7:02] that's pretty cool. That just took from
[7:04] start to finish maybe 1 minute or so to
[7:07] get a local model running on my computer
[7:09] where I can now just have a chat with it
[7:10] like I would with ChatGPT. And it's 100%
[7:13] local. Every single message I send to
[7:15] Gemma 4 stays entirely on my computer.
[7:18] It's full privacy.
[7:21] And just very quickly, if you're looking
[7:22] for an easier way to chat with your
[7:23] local models, you can also use the
[7:25] Ollama desktop app, which is what I
[7:27] downloaded to install the command line
[7:30] interface in terminal before, what we
[7:31] were using. So, if you want to have
[7:33] conversations and then you go back and
[7:34] continue the conversation, you can
[7:36] always switch the model like we did
[7:37] here. And those conversations will stay
[7:39] on the side stored on your computer in
[7:40] the same way that you would use
[7:42] something like chat GPT or cloud code,
[7:44] but all of this is happening 100%
[7:45] locally if I've selected a model that
[7:47] I've downloaded here.
[7:50] But that's just the beginning of using
[7:51] Ollama. Yeah, you can talk with it here
[7:53] in chat, but that's not really what I
[7:55] want to use it for. I want to be able to
[7:56] connect it to another tool so that it
[7:58] becomes more powerful and whenever I
[8:00] want my local agent, for example,
[8:02] something like my Hermes agent to be
[8:04] able to run and do things for me and
[8:05] connect locally to my running model in
[8:08] Ollama, I need to do a couple more
[8:10] things. So the first thing we can do
[8:11] here is end the conversation by typing
[8:13] {slash} bye. And next, what I want to
[8:15] show you is how we can go back to these
[8:17] models here and we can create what's
[8:18] called a variant.
[8:21] So a variant is basically just a
[8:23] configuration, a modified version of the
[8:25] one that we already have here, but it
[8:27] doesn't duplicate it. It doesn't take up
[8:28] another 10 gigs on your hard drive. It's
[8:30] just a different way of launching the
[8:32] same underlying model. So, for example,
[8:34] if I type in Ollama PS, we can see that
[8:37] we have Gemma 4 E4B. It's 3.3 gigs,
[8:40] operating 100% in my GPU, but its
[8:42] context window is 32,768.
[8:45] So, that's where we start to get into
[8:47] potentially a problem depending on your
[8:49] use case. For a lot of things, this is
[8:50] totally fine, but for example, if I go
[8:53] over to Hermes and we take a look at
[8:54] running uh local model inside of Hermes
[8:57] agent, we can see that a lot of Ollama
[9:00] models use a 2048 token context
[9:02] limitation and Hermes requires 64,000 to
[9:06] give your agent tools. So, it even shows
[9:09] you exactly how to get this set up
[9:10] inside of Hermes, but I'm going to use
[9:12] something a little bit easier.
[9:13] Basically, what we need to do is we need
[9:15] to create a model file that extends the
[9:16] context. So, we need to create a little
[9:18] temporary file, pull it from the Gemma
[9:21] model that we're using, establish the
[9:23] context that we want, and then get it to
[9:24] create the new model based on that
[9:27] modified one. So, let's just start a new
[9:29] terminal here so we can paste in this
[9:30] string, which I will give you in the
[9:32] description or a link to an article that
[9:34] has it, and we are pulling in Gemma 4
[9:36] E4B, which is the model that we just
[9:38] downloaded, and we're specifying that we
[9:40] want the context to be 64,000. So, it's
[9:43] creating a little variant file. You can
[9:45] say, "Okay, here we go. Gathering model
[9:46] components using existing layer.
[9:48] Success." And now if we go Ollama list,
[9:51] we can see we have Gemma 4E 64K variant.
[9:54] So, we just created this new model 8
[9:56] seconds ago, and now if we run it, and
[9:58] then we type in Ollama PS, we can see
[10:00] that now it's pulling in this model, and
[10:02] we have the context of 64,000. Instead
[10:05] of 3.3 gigs, it's now 3.4, so it's using
[10:07] up a little bit more space, but we've
[10:09] doubled the amount of working memory
[10:11] that the model is able to use running
[10:13] inside of Ollama.
[10:15] And you can also control the context
[10:17] window at the system level in Ollama. If
[10:19] we go up to settings here, and we can
[10:20] see that we have this context length.
[10:23] So, I can move this up to 64, and then
[10:25] whenever it launches a new model, it
[10:27] will always use the 64,000. So, we don't
[10:29] have to then create that variant
[10:31] manually like I did, but this will force
[10:34] everyone of your models to run at
[10:36] 64,000. So, if you only want to change
[10:38] the one model, then you would do it
[10:40] using the method I showed you in the
[10:41] terminal. But this is a pretty easy way
[10:43] to get all of your Ollama models ready
[10:45] to go with AI agents. So, that becomes
[10:47] really important because now we can
[10:49] connect it to something like Hermes, and
[10:51] we're able to actually use it. Which if
[10:52] you remember here, it says that it
[10:53] requires 64,000. But now that we have
[10:56] the model in the format that we want,
[10:58] how do we actually connect it to a tool
[11:00] like Hermes or Claude Code or Codex?
[11:02] This is where we get into what's called
[11:04] an endpoint.
[11:06] So, I'm going to use Hermes just as an
[11:08] example here because this is what I've
[11:09] been exploring lately, but you can use
[11:10] this with any type of system you want.
[11:12] But basically what we need to do is,
[11:14] rather than for example pointing it to
[11:16] ChatGPT, we need to point Hermes or
[11:18] Claude Code or Codex to the local model
[11:21] that we have sitting on our computer
[11:23] right here. So, this is where we get
[11:24] into what's called an endpoint, and this
[11:26] is what it looks like. It's just a
[11:27] string of numbers that is an address
[11:29] based on what's running or can run
[11:31] locally on our computer. This is an
[11:33] Ollama native endpoint, and then if we
[11:36] add a V1, it makes it compatible with
[11:38] what's called an OpenAI endpoint. So,
[11:40] that's what a lot of platforms use. So,
[11:42] for example, if we go into Hermes, I'm
[11:44] going to go to my Gemma profile, and I
[11:46] can go to models, and I can go down to
[11:48] custom endpoint and click set up custom
[11:50] endpoint, and this is where, like I was
[11:52] talking about it, even includes an
[11:53] example that's very similar. All I have
[11:55] to do is type in that string, that
[11:57] address that I was showing you, and then
[11:58] we click connect. We now have a custom
[12:00] endpoint connected here. If I go back
[12:02] down to the models, we now see that we
[12:05] have this model right here, Gemma 4e 64k
[12:08] latest. So, that's the same model that I
[12:10] just created the variant of right here.
[12:12] So, now if I go and want to have a
[12:13] conversation and I type in, "Hi, who are
[12:15] you?"
[12:17] It might take a moment to run. If you
[12:18] remember earlier when I clicked run on
[12:21] Ollama, it had that little spinning icon
[12:23] for a moment. So, right now what Hermes
[12:24] is doing is it's spawning a little agent
[12:27] inside of Ollama, a locally running
[12:29] model. And then once I get that started
[12:32] up, it's warmed up, I'll be able to have
[12:33] a conversation, and it can run for a few
[12:36] minutes before it goes back to sleep,
[12:38] which I'll explain in a moment. Okay,
[12:39] there we go. So, the model just woke up.
[12:40] It's analyzing the request, and we can
[12:42] see it's thinking right here. But this
[12:44] time, rather than just saying, "Hi, I'm
[12:46] Gemma." Instead, we're running it
[12:48] through Hermes. So, now it recognizes
[12:51] that it's a Hermes agent running the
[12:53] Gemma 4 model. So, that's pretty cool.
[12:54] So, we have as many different options as
[12:56] we want. Once we have the models
[12:57] installed on our computer, we can then
[12:59] have different agents harness the power
[13:02] of that model. That's why this is called
[13:03] an agent harness. You could use Claude
[13:05] code, you could use Codex, you could use
[13:07] Open Claw, you could use Hermes. This is
[13:09] kind of the beginning of fully private,
[13:12] locally run AI. And what's nice too is
[13:14] with an agent like this, you could have
[13:15] it running 24/7, but in order to do
[13:17] that, we would need to keep the model
[13:18] loaded, and this is where we can modify
[13:20] some more settings of Ollama so that it
[13:22] doesn't unload after a few minutes. So,
[13:24] you remember it took a few seconds there
[13:26] to get running. We could have it be
[13:28] always on ready to go. And what's cool,
[13:30] too, is if you have a more powerful
[13:31] system, like a desktop computer with
[13:33] bigger RAM, for example, I can
[13:35] potentially run a bigger parameter model
[13:37] on my desktop and then connect to it
[13:39] from my laptop, so I can leverage the
[13:41] power of a local model running on one
[13:43] computer, but then access it from
[13:45] another one, which I have another video
[13:46] that explains how to do that with
[13:47] Hermes, if you're interested.
[13:50] Like I mentioned earlier, one of the
[13:52] reasons I'm most excited to run a local
[13:53] model is because I can manage my
[13:55] information in something like Obsidian,
[13:57] and I can potentially have sensitive
[13:58] information sitting inside of my
[14:00] Obsidian vault, and I'm only running a
[14:02] local model that's accessing that
[14:03] information, so it's not being sent to a
[14:05] cloud provider. So, once you begin
[14:07] working with local models, it really
[14:08] opens the door to a lot of interesting
[14:11] workflows that maintain the privacy of
[14:13] your data.
[14:15] And there are so many different models
[14:17] that you can use here. You can also run
[14:18] embedding models for setting up like a
[14:21] document retrieval system. You can run
[14:23] vision, thinking, tools. There's coding
[14:26] models. So, instead of paying for Claude
[14:28] Code or Codex all of the time, you could
[14:31] run a local model for coding. Qwen 3.6
[14:33] is supposed to be incredible, and it has
[14:35] a few different sizes. So, this is a
[14:37] bigger model, but there's a lot that we
[14:39] can do here. Running local models really
[14:41] opens a lot of doors. So, I highly
[14:43] recommend exploring it. Experiment with
[14:45] it yourself.
[14:47] And there you have it. You have your own
[14:49] personal private AI running locally on
[14:51] your computer with a context-tuned
[14:54] variant that can be plugged into
[14:55] different Aagentic AI tools. To help you
[14:57] understand more of the practicality of
[14:58] this, how you can use this in real-world
[15:00] situations, I'm putting together an
[15:01] Aagentic AI playlist using Hermes, but
[15:04] other tools as well, that connect to
[15:05] local models, so you can autonomously
[15:07] run tasks that hopefully make your life
[15:09] a little bit easier. So, I recommend
[15:11] checking out that playlist if you want
[15:12] to go deeper into anything that I've
[15:13] talked about today or see how you can
[15:15] expand this system. Also, let me know if
[15:17] you have any questions about what I
[15:18] talked about, as I know this can be
[15:19] confusing for new users to AI, and
[15:21] especially running local models on your
[15:23] computer. So, if you have questions or
[15:24] there's anything you'd like to see me
[15:25] work on in future videos, please let me
[15:27] know in the comments and I'm happy to
[15:28] help you out. A reminder to please like
[15:29] and subscribe if you found this video
[15:31] helpful and consider sharing with a
[15:32] friend who's perhaps AI curious but has
[15:34] been wary of the privacy and the data
[15:37] and the cost because a local model is a
[15:39] great way to get people into using AI to
[15:42] make their lives easier without dealing
[15:43] with a lot of the potential issues that
[15:45] people face with these types of tools.
[15:47] Thanks for watching and I will see you
[15:49] in the next video.