[0:00] If you're not running LLMs locally, then
[0:02] you're missing out. Sure, Chat GBT and
[0:05] other hosted solutions are great, but if
[0:07] you care about speed, privacy, and cost,
[0:10] then you'll want to learn how to run
[0:11] them on your own machine. That's why in
[0:14] this video, I'll show you two methods of
[0:16] running LLMs locally from a developer
[0:18] perspective. I'll show you how to
[0:20] download various models, run them
[0:22] interactively, and integrate them into
[0:24] your code to replace something like the
[0:26] OpenAI API. Now, the two methods that
[0:29] I'm going to show you are O Lama and
[0:30] Docker Model Runner. Both are free,
[0:33] highly performant, and you'll learn the
[0:35] differences between them as I go through
[0:36] this video. Let's get started. So, the
[0:39] first method to show you is Olama. Now,
[0:41] Olama is a popular open- source software
[0:44] that allows you to download and run
[0:46] models locally on your own computer.
[0:48] Now, the models that you can run do
[0:49] depend on the hardware and the specs of
[0:51] that computer, but generally speaking,
[0:54] you can run models, you can manage them,
[0:56] delete them, etc., and then access those
[0:58] same models from code. So, let me give
[1:00] you a quick overview into how Olama
[1:02] works. Let's get started. So, first
[1:04] things first, go to ola.com and simply
[1:07] download O Lama for Mac, Windows, or
[1:10] Linux. Once it's downloaded, what you'll
[1:12] want to do is make sure it's running.
[1:13] So, you can go to something like your
[1:15] Spotlight search, search for O Lama, and
[1:17] simply run it and just make sure it's
[1:19] loaded on your computer. If you're on
[1:21] Windows, usually you can see if it's
[1:22] running by looking in the bottom right
[1:24] hand corner in kind of the activity bar
[1:26] and seeing if that Lama icon is
[1:28] appearing. Same thing, you can also kind
[1:30] of go to the explorer search, search for
[1:31] Olama, just double press it. Make sure
[1:33] it's running on your computer. Now, once
[1:35] Olama is running, what you're going to
[1:36] want to do is head over to your
[1:38] terminal. So, open up a terminal or a
[1:40] command prompt and simply type in the
[1:42] command Olama. If you do that and the
[1:44] command works, it means running and
[1:46] you're good to go. Now from here it's
[1:48] very straightforward on how to use this
[1:49] tool but what you're able to do is
[1:51] download and manage various different
[1:53] models. So before you can start using
[1:55] this from code you need to actually pull
[1:57] a model. So the command to pull a model
[2:00] is lama pull and then the name of the
[2:02] model. Now if you're looking for the
[2:04] names of the models you can find them
[2:06] from this Olama kind of search page
[2:08] which I'll link in the description.
[2:09] You'll see there's a ton of different
[2:11] options that you can download, right?
[2:12] You can go through all of them and if
[2:14] you want to see which ones you're
[2:15] actually able to run, you can click into
[2:17] one of them, for example, and you'll see
[2:19] the size of these various different
[2:21] models. Now, for extremely large models,
[2:24] you're going to want to have a GPU and a
[2:25] lot of RAM on your computer. And if
[2:27] you're not sure, you can go to something
[2:28] like Chat GBT and just ask it based on
[2:31] the specs of your machine if you would
[2:32] be able to run this model. So, I suggest
[2:34] when you're testing out just to go with
[2:36] a small one. So, what I'm going to do is
[2:37] just search for one called Small M2.
[2:40] This is a pretty small model that should
[2:42] run locally. Whether you have a GPU, a
[2:44] lot of RAM, CPU, doesn't matter. And you
[2:46] can see that in order to actually run it
[2:48] or to pull it, use a llama run and the
[2:50] name of the model, which is this right
[2:51] here. So, what I'll do is I'll just copy
[2:53] this name here of the one that's only
[2:54] 271 megabytes. I'm going to go to my
[2:57] terminal. I'm going to paste it. So, a
[2:59] llama pull small m2 col 135m. When I do
[3:04] that, it's going to start downloading
[3:05] the model to my machine. Once that's
[3:07] finished, I'll be right back and we can
[3:09] continue. So, the model has just
[3:10] finished downloading here. And now, if I
[3:12] want to start using it, what I can do is
[3:14] I can start by typing LS. That's going
[3:17] to show me a list of all of the
[3:19] different models that I have on my
[3:20] computer. You can see I have a bunch of
[3:21] them here, a lot of them that I was
[3:22] testing with many months ago. And if I
[3:24] want to run a particular model, then I
[3:27] can type run and then simply specify the
[3:30] name of the model that you see here. So,
[3:32] I'm going to go small M2 col135M
[3:36] and then run that. And you'll see that
[3:37] now it brings me into an interactive
[3:39] chat where I can start typing directly
[3:41] with the model. And you'll notice that
[3:43] it's quite fast because it's running on
[3:45] my own computer and we have no network
[3:46] latency. So I'll say what is let's do
[3:49] this. The capital of Canada for example
[3:53] and you can see that it says the capital
[3:55] of Canada is Paris. Okay. So that
[3:57] obviously got that wrong. Now that's the
[3:59] thing with these super small models.
[4:00] They don't have access to the internet.
[4:02] They're running locally on your own
[4:03] computer. They can make mistakes. And
[4:05] this particular one is a super small one
[4:07] that's just meant to do small predictive
[4:08] text. So don't expect it to be perfect.
[4:10] Now if you want to exit this particular
[4:13] window, you can type /by. If you do
[4:15] that, it's going to bring you out of
[4:16] this and then you can continue using
[4:18] lama. That's pretty much all you need to
[4:20] do from the command line. And now what
[4:22] I'll do is show you how you can run
[4:23] Olama models from code because that's
[4:26] where they actually become useful after
[4:28] a quick word from our sponsor boot.dev.
[4:31] It's an online learning platform
[4:32] designed specifically for back-end
[4:34] development, and it approaches learning
[4:36] in a way that's far more interactive
[4:38] than the usual video-based courses.
[4:41] Rather than having you sit through hours
[4:42] of lectures, boot.dev puts you straight
[4:45] into hands-on coding. You'll work
[4:46] directly in your browser building real
[4:49] projects while learning back-end
[4:50] fundamentals like APIs, databases, and
[4:53] serverside logic using Python and Go.
[4:56] Now, what makes it stand out is the way
[4:58] that it borrows from game design. You'll
[5:00] progress through levels, unlock new
[5:02] content, and keep your momentum up as
[5:04] you go. The platform is filled with
[5:06] exercises and practical challenges, so
[5:08] you'll end up writing a lot of code,
[5:10] which is exactly what helps you improve.
[5:12] Now, all of the core content is free to
[5:14] access, and if you decide to commit to
[5:16] the annual plan, you can use the code
[5:18] tech with Tim to get 25% off your first
[5:21] year. I've been going through it myself
[5:22] lately, and honestly, it's surprisingly
[5:24] addictive. Thanks to boot.dev. Now,
[5:26] let's get back into it. So at this
[5:28] point, I'm assuming that you've
[5:29] downloaded a llama, you've pulled at
[5:31] least one model, you're kind of familiar
[5:32] with running it, and now you probably
[5:34] want to actually use this from some type
[5:36] of application, right? So from code. So
[5:39] in order to do that, there's a few
[5:40] different ways. I'm going to show them
[5:41] to you directly here inside of Python.
[5:43] However, this works in pretty much every
[5:45] programming language. Just make a few
[5:47] small adjustments. So, whenever you have
[5:49] a llama running on your computer, it's
[5:51] going to expose an HTTP server or a REST
[5:54] API that allows you to just directly
[5:57] call the REST API and get a response.
[6:00] So, you don't need to use any fancy
[6:01] modules. You can just send a normal HTTP
[6:03] request directly to that server and it
[6:06] will give you a response from the AI
[6:08] model. So, for example, this is the URL.
[6:10] So, HTTP localhost port and then by
[6:13] default, O Lama will run on port 1434.
[6:16] You can go to the slash API/ chat. You
[6:19] can pass some data. So in this case, I'm
[6:21] passing the model I want to use is small
[6:23] M2. I want to stream this. No, I don't.
[6:26] And then I'm passing a few messages that
[6:27] I want this to respond to. So the first
[6:29] message is a system message saying
[6:31] you're a helpful assistant. And the next
[6:32] one, please write me 500 words about the
[6:34] fall of Rome. Now system messages are
[6:36] ones that tell the model what it should
[6:38] be doing and give it additional context.
[6:40] Whereas user messages are ones that will
[6:42] actually be responded to directly. I
[6:44] then send a post request to this
[6:45] endpoint and then I'm able to get the
[6:47] response and grab the message and the
[6:48] content. So if I come here and I type UV
[6:52] run and now this is in lama local and
[6:54] then main.py it will take a second here
[6:56] because it is sending the network
[6:57] request and then we should get the
[6:59] response. Okay. And you can see that we
[7:01] get the response here 500 words about
[7:03] the fall of Rome directly inside my
[7:05] terminal. Awesome. So that is the first
[7:07] way. Now the second way to use Olama
[7:09] from code is to simply import the Olama
[7:12] module. Now, in order to use this
[7:13] module, you need to install it. So, you
[7:15] would simply type pip install lama if
[7:18] you're using something like Python. And
[7:20] then you would be able to actually
[7:21] import this module and use it from code.
[7:23] So, assuming you've installed that
[7:24] module, you'll be able to import it.
[7:26] Same thing, you can reference the model
[7:27] name, pass any messages you want, and
[7:29] this time, rather than manually calling
[7:31] the HTTP server, you can simply use the
[7:33] lama.
[7:35] Function or method, you can pass the
[7:37] model and the messages, and then you can
[7:39] get the response. This works effectively
[7:41] the exact same way as the code you saw
[7:42] before just wrapped in this nice Olama
[7:44] module. So if I run the command here you
[7:46] can see same thing it will take a second
[7:48] and then it will give me the response
[7:49] and there you go you can see we get our
[7:51] 500 words same as before showing up
[7:53] inside of the terminal. So to recap, if
[7:56] you want to use a llama, you simply
[7:58] install it, pull a particular model, and
[8:00] then you can call the HTTP REST API if
[8:03] you want to get a response or you can
[8:04] use something like the Lama module if
[8:06] you're working in a language like Python
[8:08] to directly get the reply from code.
[8:10] This is the first way to run LLMs
[8:12] locally. It's very good. It's very
[8:14] popular and you can use this inside of
[8:16] frameworks like Langchain, for example,
[8:18] or really any other AI framework that
[8:20] you want. But now I want to move on to
[8:21] the next method which is a newer one
[8:23] that many people don't know about which
[8:25] is the docker model runner. Now the next
[8:27] method of running LLMs locally is
[8:29] actually my preferred choice and that is
[8:31] to use the docker model runner. Strictly
[8:34] speaking this is a better more efficient
[8:36] way to run models than using a llama. It
[8:39] works with more systems. It has better
[8:41] GPU acceleration and support and it
[8:43] works inside of containerized
[8:45] applications especially when you're
[8:46] moving to deployment. I'm not going to
[8:48] cover the entire setup and all of the
[8:50] configuration in this video. I'll give
[8:52] you a quick intro so you can see how it
[8:53] works, but I do have a longer video
[8:55] which I'll link in the description which
[8:56] explains how to use this inside of
[8:58] things like containerized applications
[9:00] that really gets the benefit of using
[9:03] Docker Model Runner. So anyways, just
[9:05] understand this is a better solution in
[9:07] almost every single way. You don't need
[9:09] to know why or all of the specific
[9:10] details. So I would suggest using this
[9:12] one and I'll show you how to set it up.
[9:15] Okay, so Docker Model Runner allows you
[9:17] to simply run models using Docker. So in
[9:20] order to use this, you do need to have
[9:22] Docker Desktop installed. If you don't
[9:24] already have it, it is completely free
[9:25] to download. And you can simply just
[9:27] download it by going to Docker Desktop,
[9:29] right? Searching for it, and then
[9:30] finding the download link like this. So
[9:32] you can download it for your operating
[9:34] system. Now, once you have Docker
[9:35] Desktop downloaded, make sure that you
[9:37] have the newest version on your computer
[9:39] and simply open it up. Similarly to
[9:41] before, you can go to your spotlight
[9:42] search and just search for Docker
[9:44] Desktop, right? And then open the
[9:46] application. Same thing if you're on
[9:48] Windows. Now, once you open the
[9:49] application, what you're going to want
[9:50] to navigate to here is this settings
[9:52] window. You're going to want to go to AI
[9:55] and make sure that you enable the Docker
[9:57] model runner. Okay? So, settings, AI,
[9:59] enable docker model runner, and then
[10:02] enable the host side TCP support and
[10:05] change cores to say all. Now, if for
[10:08] some reason you're not seeing this, you
[10:09] can go into beta features and make sure
[10:11] that you enable the Docker MCP toolkit.
[10:14] Once you do that, all of this should be
[10:16] good. You can enable this setting and
[10:18] you're good to go to run the model
[10:20] runner. Now, once you've enabled that,
[10:21] you can actually go directly inside of
[10:23] Docker Desktop, go to models, and from
[10:26] here, you can actually pull models and
[10:28] start using them directly inside of
[10:29] Docker. So, right now, we're looking at
[10:31] my local models. You can see that I have
[10:32] one here called small 2. But if I change
[10:35] over to Docker Hub, then there's a list
[10:37] of models that I can download simply
[10:39] from this UI. So, same thing as before,
[10:41] I'll just use that small two model. So,
[10:42] I can just type small. And you can see
[10:44] there's a bunch of options here. Maybe
[10:45] we want to go small three now because
[10:47] this is the updated version. I'll just
[10:49] find the smallest one, which is latest.
[10:51] And then I can just press download here
[10:53] to pull. So, I can do this directly from
[10:55] the user interface. And then I'll be
[10:56] able to run the model right inside of
[10:58] Docker Desktop. For example, this model
[11:00] that's already downloaded. I can just go
[11:02] here to run and I can start chatting
[11:04] with it directly inside of this UI. You
[11:06] can see if I say hello, it just gives me
[11:07] the response. Now I can also go into
[11:09] inspect. I can see all the information
[11:11] and I can view all of the requests that
[11:13] have been sent, the context, usage,
[11:14] duration, etc., etc. So that's great. We
[11:17] can do that directly from the UI, which
[11:19] in my opinion is a little bit easier
[11:20] than a llama or we can do it directly
[11:22] from the command line. So same thing as
[11:24] before, if you have docker model runner
[11:26] enabled, you can simply type the command
[11:28] docker model. When you do this, it will
[11:30] give you a very similar UI to O Lama.
[11:33] From here, you could type docker model
[11:35] pull and then you can pull a particular
[11:37] model. So again, same thing, small m2 if
[11:39] we wanted to pull that and then it would
[11:41] start downloading it. You can see in my
[11:42] case, it's cached. I already have it. We
[11:44] can type docker model list. If we do
[11:48] that, it gives us a list of the local
[11:49] models that we have available. And then
[11:51] there's a few other ones that we can
[11:52] look at here, right? Like packaging
[11:53] models, listing them, inspecting,
[11:55] running, deleting, etc., etc. So also
[11:58] from here we can go docker model run
[12:01] small m2. When we do that brings us to
[12:03] the interactive terminal. Same thing we
[12:05] can type hello. If we want to escape we
[12:07] can type /by and it will remove us from
[12:09] there. Okay. So effectively same thing
[12:12] as a llama just a slightly better user
[12:13] interface in my opinion. If you're
[12:15] looking for all of the different models
[12:16] you can find them directly from the UI
[12:18] here or you can go to the hub.docker.com
[12:22] which I will link in the description and
[12:24] search for all of the available models
[12:25] on this page. Cool. So with that said,
[12:27] similarly to before, let me show you how
[12:29] you can use the Docker model runner
[12:31] directly from code. So what you're
[12:33] seeing on my screen is an example of
[12:34] Python code that sends a request to
[12:37] docker model runner to use the same
[12:39] model as before, the small M2 model,
[12:41] this time from docker model runner as
[12:43] opposed to now. You'll notice that the
[12:46] only thing that's changed about this
[12:48] code is simply the port that I am
[12:50] calling. Now, Olama, if you remember
[12:52] before, runs an HTTP REST API on port
[12:56] 11434.
[12:57] Docker Model Runner runs it on port
[13:00] 12434. So, the only change is that it's
[13:03] 12 versus 11. So, all I have to do is
[13:06] simply change my URL here to be 12434
[13:10] and slightly change the endpoint or path
[13:13] that you can see. And similarly to
[13:15] before, I can send a request directly to
[13:17] Docker Model Runner and get a response.
[13:20] That is because just like Olama, Docker
[13:21] model runner runs its own REST API that
[13:24] you can call. So what I can do here is
[13:26] type uvron docker local/main.py
[13:31] and you'll see that it will just take a
[13:32] second here and it will give me the
[13:33] response. Okay. And there we go. You can
[13:36] see I get the 500word response popping
[13:38] up directly here. Now similarly to that
[13:41] we also can directly use modules like
[13:43] the OpenAI module. So previously I used
[13:46] the Olama module. Obviously, we're not
[13:47] going to use the Olava module when we're
[13:49] working with Docker Model Runner.
[13:51] However, there are modules like OpenAI
[13:53] or Langchain. Now, those by default are
[13:56] going to use something like OpenAI's
[13:58] public API. However, they have the
[14:01] ability to actually override the base
[14:03] URL. So, we can actually change the base
[14:06] URL to be the Docker model runner, which
[14:08] is running locally on our own computer.
[14:10] And now we can use this module just like
[14:13] we would before with all of the nice
[14:15] completions and methods except instead
[14:17] of it sending a request to OpenAI which
[14:19] we would need to pay for, we can send it
[14:21] locally to our own machine. So you can
[14:24] see that if I want to specify the model
[14:25] name, I do AI slash and then the name of
[14:28] the model. So in this case, small M2. I
[14:30] can specify a prompt and then I can use
[14:32] the OpenAI module just like I would for
[14:34] any OpenAI request. So for example, UV
[14:37] run docker local slash and then this
[14:40] open AI module. Same thing just wait a
[14:43] second and you can see it explains how
[14:45] transformers work. Boom. So there you
[14:47] go. That is how you use it directly from
[14:49] code. Now a lot of you at this point
[14:51] might be asking okay well what's the
[14:52] difference between the docker model
[14:54] runner and strictly speaking the docker
[14:56] model runner is more optimized works
[14:58] better with GPUs has more support and
[15:01] also works for containerized
[15:03] applications. So right now I'm just
[15:05] showing you a very basic app. However,
[15:07] if I had this wrapped inside of a Docker
[15:09] container, I can actually expose the
[15:11] Docker model runner directly inside of
[15:13] that container. I can have the model
[15:16] actually wrapped in the container so
[15:17] that it doesn't take up a massive amount
[15:19] of space and then I can run the model
[15:21] directly inside of the containerized
[15:23] app. Now, I'm not going to explain that
[15:25] in this video because I know a lot of
[15:26] you won't find that that useful.
[15:28] However, I do have a video on YouTube,
[15:30] I'll link it in the description, that
[15:31] explains how to do this, where
[15:32] essentially you can expose an LLM
[15:34] service to your Docker container in
[15:36] something like a Docker Compose file
[15:38] that allows you to specify what model
[15:40] you want to run on what port, etc.,
[15:43] etc., and that will allow it to ingest
[15:45] the AI model in just a much better way,
[15:47] especially when it comes to deploying
[15:48] this application out. So, with that
[15:50] said, that's going to wrap up this
[15:52] video. I hope that you found this
[15:53] useful. If you did, make sure to leave a
[15:55] like, subscribe, and I will see you in
[15:57] the next one.
[16:00] [music]