[0:00] If you're not running LLMs locally, then [0:02] you're missing out. Sure, Chat GBT and [0:05] other hosted solutions are great, but if [0:07] you care about speed, privacy, and cost, [0:10] then you'll want to learn how to run [0:11] them on your own machine. That's why in [0:14] this video, I'll show you two methods of [0:16] running LLMs locally from a developer [0:18] perspective. I'll show you how to [0:20] download various models, run them [0:22] interactively, and integrate them into [0:24] your code to replace something like the [0:26] OpenAI API. Now, the two methods that [0:29] I'm going to show you are O Lama and [0:30] Docker Model Runner. Both are free, [0:33] highly performant, and you'll learn the [0:35] differences between them as I go through [0:36] this video. Let's get started. So, the [0:39] first method to show you is Olama. Now, [0:41] Olama is a popular open- source software [0:44] that allows you to download and run [0:46] models locally on your own computer. [0:48] Now, the models that you can run do [0:49] depend on the hardware and the specs of [0:51] that computer, but generally speaking, [0:54] you can run models, you can manage them, [0:56] delete them, etc., and then access those [0:58] same models from code. So, let me give [1:00] you a quick overview into how Olama [1:02] works. Let's get started. So, first [1:04] things first, go to ola.com and simply [1:07] download O Lama for Mac, Windows, or [1:10] Linux. Once it's downloaded, what you'll [1:12] want to do is make sure it's running. [1:13] So, you can go to something like your [1:15] Spotlight search, search for O Lama, and [1:17] simply run it and just make sure it's [1:19] loaded on your computer. If you're on [1:21] Windows, usually you can see if it's [1:22] running by looking in the bottom right [1:24] hand corner in kind of the activity bar [1:26] and seeing if that Lama icon is [1:28] appearing. Same thing, you can also kind [1:30] of go to the explorer search, search for [1:31] Olama, just double press it. Make sure [1:33] it's running on your computer. Now, once [1:35] Olama is running, what you're going to [1:36] want to do is head over to your [1:38] terminal. So, open up a terminal or a [1:40] command prompt and simply type in the [1:42] command Olama. If you do that and the [1:44] command works, it means running and [1:46] you're good to go. Now from here it's [1:48] very straightforward on how to use this [1:49] tool but what you're able to do is [1:51] download and manage various different [1:53] models. So before you can start using [1:55] this from code you need to actually pull [1:57] a model. So the command to pull a model [2:00] is lama pull and then the name of the [2:02] model. Now if you're looking for the [2:04] names of the models you can find them [2:06] from this Olama kind of search page [2:08] which I'll link in the description. [2:09] You'll see there's a ton of different [2:11] options that you can download, right? [2:12] You can go through all of them and if [2:14] you want to see which ones you're [2:15] actually able to run, you can click into [2:17] one of them, for example, and you'll see [2:19] the size of these various different [2:21] models. Now, for extremely large models, [2:24] you're going to want to have a GPU and a [2:25] lot of RAM on your computer. And if [2:27] you're not sure, you can go to something [2:28] like Chat GBT and just ask it based on [2:31] the specs of your machine if you would [2:32] be able to run this model. So, I suggest [2:34] when you're testing out just to go with [2:36] a small one. So, what I'm going to do is [2:37] just search for one called Small M2. [2:40] This is a pretty small model that should [2:42] run locally. Whether you have a GPU, a [2:44] lot of RAM, CPU, doesn't matter. And you [2:46] can see that in order to actually run it [2:48] or to pull it, use a llama run and the [2:50] name of the model, which is this right [2:51] here. So, what I'll do is I'll just copy [2:53] this name here of the one that's only [2:54] 271 megabytes. I'm going to go to my [2:57] terminal. I'm going to paste it. So, a [2:59] llama pull small m2 col 135m. When I do [3:04] that, it's going to start downloading [3:05] the model to my machine. Once that's [3:07] finished, I'll be right back and we can [3:09] continue. So, the model has just [3:10] finished downloading here. And now, if I [3:12] want to start using it, what I can do is [3:14] I can start by typing LS. That's going [3:17] to show me a list of all of the [3:19] different models that I have on my [3:20] computer. You can see I have a bunch of [3:21] them here, a lot of them that I was [3:22] testing with many months ago. And if I [3:24] want to run a particular model, then I [3:27] can type run and then simply specify the [3:30] name of the model that you see here. So, [3:32] I'm going to go small M2 col135M [3:36] and then run that. And you'll see that [3:37] now it brings me into an interactive [3:39] chat where I can start typing directly [3:41] with the model. And you'll notice that [3:43] it's quite fast because it's running on [3:45] my own computer and we have no network [3:46] latency. So I'll say what is let's do [3:49] this. The capital of Canada for example [3:53] and you can see that it says the capital [3:55] of Canada is Paris. Okay. So that [3:57] obviously got that wrong. Now that's the [3:59] thing with these super small models. [4:00] They don't have access to the internet. [4:02] They're running locally on your own [4:03] computer. They can make mistakes. And [4:05] this particular one is a super small one [4:07] that's just meant to do small predictive [4:08] text. So don't expect it to be perfect. [4:10] Now if you want to exit this particular [4:13] window, you can type /by. If you do [4:15] that, it's going to bring you out of [4:16] this and then you can continue using [4:18] lama. That's pretty much all you need to [4:20] do from the command line. And now what [4:22] I'll do is show you how you can run [4:23] Olama models from code because that's [4:26] where they actually become useful after [4:28] a quick word from our sponsor boot.dev. [4:31] It's an online learning platform [4:32] designed specifically for back-end [4:34] development, and it approaches learning [4:36] in a way that's far more interactive [4:38] than the usual video-based courses. [4:41] Rather than having you sit through hours [4:42] of lectures, boot.dev puts you straight [4:45] into hands-on coding. You'll work [4:46] directly in your browser building real [4:49] projects while learning back-end [4:50] fundamentals like APIs, databases, and [4:53] serverside logic using Python and Go. [4:56] Now, what makes it stand out is the way [4:58] that it borrows from game design. You'll [5:00] progress through levels, unlock new [5:02] content, and keep your momentum up as [5:04] you go. The platform is filled with [5:06] exercises and practical challenges, so [5:08] you'll end up writing a lot of code, [5:10] which is exactly what helps you improve. [5:12] Now, all of the core content is free to [5:14] access, and if you decide to commit to [5:16] the annual plan, you can use the code [5:18] tech with Tim to get 25% off your first [5:21] year. I've been going through it myself [5:22] lately, and honestly, it's surprisingly [5:24] addictive. Thanks to boot.dev. Now, [5:26] let's get back into it. So at this [5:28] point, I'm assuming that you've [5:29] downloaded a llama, you've pulled at [5:31] least one model, you're kind of familiar [5:32] with running it, and now you probably [5:34] want to actually use this from some type [5:36] of application, right? So from code. So [5:39] in order to do that, there's a few [5:40] different ways. I'm going to show them [5:41] to you directly here inside of Python. [5:43] However, this works in pretty much every [5:45] programming language. Just make a few [5:47] small adjustments. So, whenever you have [5:49] a llama running on your computer, it's [5:51] going to expose an HTTP server or a REST [5:54] API that allows you to just directly [5:57] call the REST API and get a response. [6:00] So, you don't need to use any fancy [6:01] modules. You can just send a normal HTTP [6:03] request directly to that server and it [6:06] will give you a response from the AI [6:08] model. So, for example, this is the URL. [6:10] So, HTTP localhost port and then by [6:13] default, O Lama will run on port 1434. [6:16] You can go to the slash API/ chat. You [6:19] can pass some data. So in this case, I'm [6:21] passing the model I want to use is small [6:23] M2. I want to stream this. No, I don't. [6:26] And then I'm passing a few messages that [6:27] I want this to respond to. So the first [6:29] message is a system message saying [6:31] you're a helpful assistant. And the next [6:32] one, please write me 500 words about the [6:34] fall of Rome. Now system messages are [6:36] ones that tell the model what it should [6:38] be doing and give it additional context. [6:40] Whereas user messages are ones that will [6:42] actually be responded to directly. I [6:44] then send a post request to this [6:45] endpoint and then I'm able to get the [6:47] response and grab the message and the [6:48] content. So if I come here and I type UV [6:52] run and now this is in lama local and [6:54] then main.py it will take a second here [6:56] because it is sending the network [6:57] request and then we should get the [6:59] response. Okay. And you can see that we [7:01] get the response here 500 words about [7:03] the fall of Rome directly inside my [7:05] terminal. Awesome. So that is the first [7:07] way. Now the second way to use Olama [7:09] from code is to simply import the Olama [7:12] module. Now, in order to use this [7:13] module, you need to install it. So, you [7:15] would simply type pip install lama if [7:18] you're using something like Python. And [7:20] then you would be able to actually [7:21] import this module and use it from code. [7:23] So, assuming you've installed that [7:24] module, you'll be able to import it. [7:26] Same thing, you can reference the model [7:27] name, pass any messages you want, and [7:29] this time, rather than manually calling [7:31] the HTTP server, you can simply use the [7:33] lama. [7:35] Function or method, you can pass the [7:37] model and the messages, and then you can [7:39] get the response. This works effectively [7:41] the exact same way as the code you saw [7:42] before just wrapped in this nice Olama [7:44] module. So if I run the command here you [7:46] can see same thing it will take a second [7:48] and then it will give me the response [7:49] and there you go you can see we get our [7:51] 500 words same as before showing up [7:53] inside of the terminal. So to recap, if [7:56] you want to use a llama, you simply [7:58] install it, pull a particular model, and [8:00] then you can call the HTTP REST API if [8:03] you want to get a response or you can [8:04] use something like the Lama module if [8:06] you're working in a language like Python [8:08] to directly get the reply from code. [8:10] This is the first way to run LLMs [8:12] locally. It's very good. It's very [8:14] popular and you can use this inside of [8:16] frameworks like Langchain, for example, [8:18] or really any other AI framework that [8:20] you want. But now I want to move on to [8:21] the next method which is a newer one [8:23] that many people don't know about which [8:25] is the docker model runner. Now the next [8:27] method of running LLMs locally is [8:29] actually my preferred choice and that is [8:31] to use the docker model runner. Strictly [8:34] speaking this is a better more efficient [8:36] way to run models than using a llama. It [8:39] works with more systems. It has better [8:41] GPU acceleration and support and it [8:43] works inside of containerized [8:45] applications especially when you're [8:46] moving to deployment. I'm not going to [8:48] cover the entire setup and all of the [8:50] configuration in this video. I'll give [8:52] you a quick intro so you can see how it [8:53] works, but I do have a longer video [8:55] which I'll link in the description which [8:56] explains how to use this inside of [8:58] things like containerized applications [9:00] that really gets the benefit of using [9:03] Docker Model Runner. So anyways, just [9:05] understand this is a better solution in [9:07] almost every single way. You don't need [9:09] to know why or all of the specific [9:10] details. So I would suggest using this [9:12] one and I'll show you how to set it up. [9:15] Okay, so Docker Model Runner allows you [9:17] to simply run models using Docker. So in [9:20] order to use this, you do need to have [9:22] Docker Desktop installed. If you don't [9:24] already have it, it is completely free [9:25] to download. And you can simply just [9:27] download it by going to Docker Desktop, [9:29] right? Searching for it, and then [9:30] finding the download link like this. So [9:32] you can download it for your operating [9:34] system. Now, once you have Docker [9:35] Desktop downloaded, make sure that you [9:37] have the newest version on your computer [9:39] and simply open it up. Similarly to [9:41] before, you can go to your spotlight [9:42] search and just search for Docker [9:44] Desktop, right? And then open the [9:46] application. Same thing if you're on [9:48] Windows. Now, once you open the [9:49] application, what you're going to want [9:50] to navigate to here is this settings [9:52] window. You're going to want to go to AI [9:55] and make sure that you enable the Docker [9:57] model runner. Okay? So, settings, AI, [9:59] enable docker model runner, and then [10:02] enable the host side TCP support and [10:05] change cores to say all. Now, if for [10:08] some reason you're not seeing this, you [10:09] can go into beta features and make sure [10:11] that you enable the Docker MCP toolkit. [10:14] Once you do that, all of this should be [10:16] good. You can enable this setting and [10:18] you're good to go to run the model [10:20] runner. Now, once you've enabled that, [10:21] you can actually go directly inside of [10:23] Docker Desktop, go to models, and from [10:26] here, you can actually pull models and [10:28] start using them directly inside of [10:29] Docker. So, right now, we're looking at [10:31] my local models. You can see that I have [10:32] one here called small 2. But if I change [10:35] over to Docker Hub, then there's a list [10:37] of models that I can download simply [10:39] from this UI. So, same thing as before, [10:41] I'll just use that small two model. So, [10:42] I can just type small. And you can see [10:44] there's a bunch of options here. Maybe [10:45] we want to go small three now because [10:47] this is the updated version. I'll just [10:49] find the smallest one, which is latest. [10:51] And then I can just press download here [10:53] to pull. So, I can do this directly from [10:55] the user interface. And then I'll be [10:56] able to run the model right inside of [10:58] Docker Desktop. For example, this model [11:00] that's already downloaded. I can just go [11:02] here to run and I can start chatting [11:04] with it directly inside of this UI. You [11:06] can see if I say hello, it just gives me [11:07] the response. Now I can also go into [11:09] inspect. I can see all the information [11:11] and I can view all of the requests that [11:13] have been sent, the context, usage, [11:14] duration, etc., etc. So that's great. We [11:17] can do that directly from the UI, which [11:19] in my opinion is a little bit easier [11:20] than a llama or we can do it directly [11:22] from the command line. So same thing as [11:24] before, if you have docker model runner [11:26] enabled, you can simply type the command [11:28] docker model. When you do this, it will [11:30] give you a very similar UI to O Lama. [11:33] From here, you could type docker model [11:35] pull and then you can pull a particular [11:37] model. So again, same thing, small m2 if [11:39] we wanted to pull that and then it would [11:41] start downloading it. You can see in my [11:42] case, it's cached. I already have it. We [11:44] can type docker model list. If we do [11:48] that, it gives us a list of the local [11:49] models that we have available. And then [11:51] there's a few other ones that we can [11:52] look at here, right? Like packaging [11:53] models, listing them, inspecting, [11:55] running, deleting, etc., etc. So also [11:58] from here we can go docker model run [12:01] small m2. When we do that brings us to [12:03] the interactive terminal. Same thing we [12:05] can type hello. If we want to escape we [12:07] can type /by and it will remove us from [12:09] there. Okay. So effectively same thing [12:12] as a llama just a slightly better user [12:13] interface in my opinion. If you're [12:15] looking for all of the different models [12:16] you can find them directly from the UI [12:18] here or you can go to the hub.docker.com [12:22] which I will link in the description and [12:24] search for all of the available models [12:25] on this page. Cool. So with that said, [12:27] similarly to before, let me show you how [12:29] you can use the Docker model runner [12:31] directly from code. So what you're [12:33] seeing on my screen is an example of [12:34] Python code that sends a request to [12:37] docker model runner to use the same [12:39] model as before, the small M2 model, [12:41] this time from docker model runner as [12:43] opposed to now. You'll notice that the [12:46] only thing that's changed about this [12:48] code is simply the port that I am [12:50] calling. Now, Olama, if you remember [12:52] before, runs an HTTP REST API on port [12:56] 11434. [12:57] Docker Model Runner runs it on port [13:00] 12434. So, the only change is that it's [13:03] 12 versus 11. So, all I have to do is [13:06] simply change my URL here to be 12434 [13:10] and slightly change the endpoint or path [13:13] that you can see. And similarly to [13:15] before, I can send a request directly to [13:17] Docker Model Runner and get a response. [13:20] That is because just like Olama, Docker [13:21] model runner runs its own REST API that [13:24] you can call. So what I can do here is [13:26] type uvron docker local/main.py [13:31] and you'll see that it will just take a [13:32] second here and it will give me the [13:33] response. Okay. And there we go. You can [13:36] see I get the 500word response popping [13:38] up directly here. Now similarly to that [13:41] we also can directly use modules like [13:43] the OpenAI module. So previously I used [13:46] the Olama module. Obviously, we're not [13:47] going to use the Olava module when we're [13:49] working with Docker Model Runner. [13:51] However, there are modules like OpenAI [13:53] or Langchain. Now, those by default are [13:56] going to use something like OpenAI's [13:58] public API. However, they have the [14:01] ability to actually override the base [14:03] URL. So, we can actually change the base [14:06] URL to be the Docker model runner, which [14:08] is running locally on our own computer. [14:10] And now we can use this module just like [14:13] we would before with all of the nice [14:15] completions and methods except instead [14:17] of it sending a request to OpenAI which [14:19] we would need to pay for, we can send it [14:21] locally to our own machine. So you can [14:24] see that if I want to specify the model [14:25] name, I do AI slash and then the name of [14:28] the model. So in this case, small M2. I [14:30] can specify a prompt and then I can use [14:32] the OpenAI module just like I would for [14:34] any OpenAI request. So for example, UV [14:37] run docker local slash and then this [14:40] open AI module. Same thing just wait a [14:43] second and you can see it explains how [14:45] transformers work. Boom. So there you [14:47] go. That is how you use it directly from [14:49] code. Now a lot of you at this point [14:51] might be asking okay well what's the [14:52] difference between the docker model [14:54] runner and strictly speaking the docker [14:56] model runner is more optimized works [14:58] better with GPUs has more support and [15:01] also works for containerized [15:03] applications. So right now I'm just [15:05] showing you a very basic app. However, [15:07] if I had this wrapped inside of a Docker [15:09] container, I can actually expose the [15:11] Docker model runner directly inside of [15:13] that container. I can have the model [15:16] actually wrapped in the container so [15:17] that it doesn't take up a massive amount [15:19] of space and then I can run the model [15:21] directly inside of the containerized [15:23] app. Now, I'm not going to explain that [15:25] in this video because I know a lot of [15:26] you won't find that that useful. [15:28] However, I do have a video on YouTube, [15:30] I'll link it in the description, that [15:31] explains how to do this, where [15:32] essentially you can expose an LLM [15:34] service to your Docker container in [15:36] something like a Docker Compose file [15:38] that allows you to specify what model [15:40] you want to run on what port, etc., [15:43] etc., and that will allow it to ingest [15:45] the AI model in just a much better way, [15:47] especially when it comes to deploying [15:48] this application out. So, with that [15:50] said, that's going to wrap up this [15:52] video. I hope that you found this [15:53] useful. If you did, make sure to leave a [15:55] like, subscribe, and I will see you in [15:57] the next one. [16:00] [music]