AI Summary
This video demonstrates two methods for running large language models (LLMs) locally: Ollama and Docker Model Runner. Both are free, high-performance alternatives to hosted solutions, offering benefits in speed, privacy, and cost. The tutorial covers downloading models, running them interactively, and integrating them into code via REST APIs or Python modules.
Chapters
Running LLMs locally provides speed, privacy, and cost benefits over hosted solutions like ChatGPT.
Ollama is a popular open-source tool for downloading and managing local LLMs, accessible via command line and code.
Download Ollama from ollama.com for Mac, Windows, or Linux, then ensure it's running (check system tray or taskbar).
Use `ollama pull <model-name>` to download a model. For testing, start with a small model like 'small-m2' (271 MB).
Use `ollama run <model-name>` to enter an interactive chat. Type `/bye` to exit.
Ollama exposes an HTTP server on port 11434. Send POST requests to `/api/chat` with model and messages to get responses.
Install the `ollama` Python module (`pip install ollama`) and use `ollama.chat()` for a simpler interface.
Docker Model Runner is a newer, more efficient method with better GPU acceleration and container support.
Install Docker Desktop (latest version), enable the AI model runner in settings, and enable host-side TCP support.
In Docker Desktop, go to Models, pull models from Docker Hub, and run them interactively from the UI.
Use `docker model pull`, `docker model list`, and `docker model run` commands similar to Ollama.
Docker Model Runner runs on port 12434. Change the URL from 11434 to 12434 and adjust the endpoint path.
Override the base URL of the OpenAI module to point to Docker Model Runner (localhost:12434) to use local models with familiar API.
Both Ollama and Docker Model Runner are effective for running LLMs locally, with Docker Model Runner being more optimized for containerized deployments. The video provides practical steps to get started with either method, from installation to code integration.
Mentioned in this Video
Tutorial Checklist
Study Flashcards (10)
What are the two methods shown for running LLMs locally?
easy
Click to reveal answer
What are the two methods shown for running LLMs locally?
Ollama and Docker Model Runner.
0:29
What is the default port for Ollama's REST API?
easy
Click to reveal answer
What is the default port for Ollama's REST API?
11434.
6:13
What is the default port for Docker Model Runner's REST API?
easy
Click to reveal answer
What is the default port for Docker Model Runner's REST API?
12434.
12:57
What command is used to pull a model in Ollama?
easy
Click to reveal answer
What command is used to pull a model in Ollama?
`ollama pull <model-name>`.
1:57
What command is used to list downloaded models in Ollama?
easy
Click to reveal answer
What command is used to list downloaded models in Ollama?
`ollama list`.
3:14
How do you exit an interactive chat session in Ollama?
easy
Click to reveal answer
How do you exit an interactive chat session in Ollama?
Type `/bye`.
4:13
What is the Python module name for Ollama?
medium
Click to reveal answer
What is the Python module name for Ollama?
`ollama` (install via `pip install ollama`).
7:15
What setting must be enabled in Docker Desktop to use the model runner?
medium
Click to reveal answer
What setting must be enabled in Docker Desktop to use the model runner?
Enable Docker Model Runner in Settings > AI, and enable host-side TCP support.
9:55
How can you use the OpenAI Python module with a local Docker Model Runner?
hard
Click to reveal answer
How can you use the OpenAI Python module with a local Docker Model Runner?
Override the base URL to `http://localhost:12434/v1` and use the model name with prefix `ai/`.
13:55
What is the main advantage of Docker Model Runner over Ollama according to the video?
medium
Click to reveal answer
What is the main advantage of Docker Model Runner over Ollama according to the video?
It is more optimized, works better with GPUs, and supports containerized applications.
14:54
💡 Key Takeaways
Benefits of Local LLMs
Clearly states the core value proposition: speed, privacy, and cost savings.
Ollama REST API Integration
Demonstrates a universal method to call local LLMs from any programming language via HTTP.
5:49Docker Model Runner as Preferred Method
Highlights a newer, more efficient alternative with better GPU and container support.
8:27Using OpenAI Module with Local Models
Shows how to reuse existing OpenAI code with a local model by changing the base URL.
13:41Full Transcript
[00:00] If you're not running LLMs locally, then
[00:02] you're missing out. Sure, Chat GBT and
[00:05] other hosted solutions are great, but if
[00:07] you care about speed, privacy, and cost,
[00:10] then you'll want to learn how to run
[00:11] them on your own machine. That's why in
[00:14] this video, I'll show you two methods of
[00:16] running LLMs locally from a developer
[00:18] perspective. I'll show you how to
[00:20] download various models, run them
[00:22] interactively, and integrate them into
[00:24] your code to replace something like the
[00:26] OpenAI API. Now, the two methods that
[00:29] I'm going to show you are O Lama and
[00:30] Docker Model Runner. Both are free,
[00:33] highly performant, and you'll learn the
[00:35] differences between them as I go through
[00:36] this video. Let's get started. So, the
[00:39] first method to show you is Olama. Now,
[00:41] Olama is a popular open- source software
[00:44] that allows you to download and run
[00:46] models locally on your own computer.
[00:48] Now, the models that you can run do
[00:49] depend on the hardware and the specs of
[00:51] that computer, but generally speaking,
[00:54] you can run models, you can manage them,
[00:56] delete them, etc., and then access those
[00:58] same models from code. So, let me give
[01:00] you a quick overview into how Olama
[01:02] works. Let's get started. So, first
[01:04] things first, go to ola.com and simply
[01:07] download O Lama for Mac, Windows, or
[01:10] Linux. Once it's downloaded, what you'll
[01:12] want to do is make sure it's running.
[01:13] So, you can go to something like your
[01:15] Spotlight search, search for O Lama, and
[01:17] simply run it and just make sure it's
[01:19] loaded on your computer. If you're on
[01:21] Windows, usually you can see if it's
[01:22] running by looking in the bottom right
[01:24] hand corner in kind of the activity bar
[01:26] and seeing if that Lama icon is
[01:28] appearing. Same thing, you can also kind
[01:30] of go to the explorer search, search for
[01:31] Olama, just double press it. Make sure
[01:33] it's running on your computer. Now, once
[01:35] Olama is running, what you're going to
[01:36] want to do is head over to your
[01:38] terminal. So, open up a terminal or a
[01:40] command prompt and simply type in the
[01:42] command Olama. If you do that and the
[01:44] command works, it means running and
[01:46] you're good to go. Now from here it's
[01:48] very straightforward on how to use this
[01:49] tool but what you're able to do is
[01:51] download and manage various different
[01:53] models. So before you can start using
[01:55] this from code you need to actually pull
[01:57] a model. So the command to pull a model
[02:00] is lama pull and then the name of the
[02:02] model. Now if you're looking for the
[02:04] names of the models you can find them
[02:06] from this Olama kind of search page
[02:08] which I'll link in the description.
[02:09] You'll see there's a ton of different
[02:11] options that you can download, right?
[02:12] You can go through all of them and if
[02:14] you want to see which ones you're
[02:15] actually able to run, you can click into
[02:17] one of them, for example, and you'll see
[02:19] the size of these various different
[02:21] models. Now, for extremely large models,
[02:24] you're going to want to have a GPU and a
[02:25] lot of RAM on your computer. And if
[02:27] you're not sure, you can go to something
[02:28] like Chat GBT and just ask it based on
[02:31] the specs of your machine if you would
[02:32] be able to run this model. So, I suggest
[02:34] when you're testing out just to go with
[02:36] a small one. So, what I'm going to do is
[02:37] just search for one called Small M2.
[02:40] This is a pretty small model that should
[02:42] run locally. Whether you have a GPU, a
[02:44] lot of RAM, CPU, doesn't matter. And you
[02:46] can see that in order to actually run it
[02:48] or to pull it, use a llama run and the
[02:50] name of the model, which is this right
[02:51] here. So, what I'll do is I'll just copy
[02:53] this name here of the one that's only
[02:54] 271 megabytes. I'm going to go to my
[02:57] terminal. I'm going to paste it. So, a
[02:59] llama pull small m2 col 135m. When I do
[03:04] that, it's going to start downloading
[03:05] the model to my machine. Once that's
[03:07] finished, I'll be right back and we can
[03:09] continue. So, the model has just
[03:10] finished downloading here. And now, if I
[03:12] want to start using it, what I can do is
[03:14] I can start by typing LS. That's going
[03:17] to show me a list of all of the
[03:19] different models that I have on my
[03:20] computer. You can see I have a bunch of
[03:21] them here, a lot of them that I was
[03:22] testing with many months ago. And if I
[03:24] want to run a particular model, then I
[03:27] can type run and then simply specify the
[03:30] name of the model that you see here. So,
[03:32] I'm going to go small M2 col135M
[03:36] and then run that. And you'll see that
[03:37] now it brings me into an interactive
[03:39] chat where I can start typing directly
[03:41] with the model. And you'll notice that
[03:43] it's quite fast because it's running on
[03:45] my own computer and we have no network
[03:46] latency. So I'll say what is let's do
[03:49] this. The capital of Canada for example
[03:53] and you can see that it says the capital
[03:55] of Canada is Paris. Okay. So that
[03:57] obviously got that wrong. Now that's the
[03:59] thing with these super small models.
[04:00] They don't have access to the internet.
[04:02] They're running locally on your own
[04:03] computer. They can make mistakes. And
[04:05] this particular one is a super small one
[04:07] that's just meant to do small predictive
[04:08] text. So don't expect it to be perfect.
[04:10] Now if you want to exit this particular
[04:13] window, you can type /by. If you do
[04:15] that, it's going to bring you out of
[04:16] this and then you can continue using
[04:18] lama. That's pretty much all you need to
[04:20] do from the command line. And now what
[04:22] I'll do is show you how you can run
[04:23] Olama models from code because that's
[04:26] where they actually become useful after
[04:28] a quick word from our sponsor boot.dev.
[04:31] It's an online learning platform
[04:32] designed specifically for back-end
[04:34] development, and it approaches learning
[04:36] in a way that's far more interactive
[04:38] than the usual video-based courses.
[04:41] Rather than having you sit through hours
[04:42] of lectures, boot.dev puts you straight
[04:45] into hands-on coding. You'll work
[04:46] directly in your browser building real
[04:49] projects while learning back-end
[04:50] fundamentals like APIs, databases, and
[04:53] serverside logic using Python and Go.
[04:56] Now, what makes it stand out is the way
[04:58] that it borrows from game design. You'll
[05:00] progress through levels, unlock new
[05:02] content, and keep your momentum up as
[05:04] you go. The platform is filled with
[05:06] exercises and practical challenges, so
[05:08] you'll end up writing a lot of code,
[05:10] which is exactly what helps you improve.
[05:12] Now, all of the core content is free to
[05:14] access, and if you decide to commit to
[05:16] the annual plan, you can use the code
[05:18] tech with Tim to get 25% off your first
[05:21] year. I've been going through it myself
[05:22] lately, and honestly, it's surprisingly
[05:24] addictive. Thanks to boot.dev. Now,
[05:26] let's get back into it. So at this
[05:28] point, I'm assuming that you've
[05:29] downloaded a llama, you've pulled at
[05:31] least one model, you're kind of familiar
[05:32] with running it, and now you probably
[05:34] want to actually use this from some type
[05:36] of application, right? So from code. So
[05:39] in order to do that, there's a few
[05:40] different ways. I'm going to show them
[05:41] to you directly here inside of Python.
[05:43] However, this works in pretty much every
[05:45] programming language. Just make a few
[05:47] small adjustments. So, whenever you have
[05:49] a llama running on your computer, it's
[05:51] going to expose an HTTP server or a REST
[05:54] API that allows you to just directly
[05:57] call the REST API and get a response.
[06:00] So, you don't need to use any fancy
[06:01] modules. You can just send a normal HTTP
[06:03] request directly to that server and it
[06:06] will give you a response from the AI
[06:08] model. So, for example, this is the URL.
[06:10] So, HTTP localhost port and then by
[06:13] default, O Lama will run on port 1434.
[06:16] You can go to the slash API/ chat. You
[06:19] can pass some data. So in this case, I'm
[06:21] passing the model I want to use is small
[06:23] M2. I want to stream this. No, I don't.
[06:26] And then I'm passing a few messages that
[06:27] I want this to respond to. So the first
[06:29] message is a system message saying
[06:31] you're a helpful assistant. And the next
[06:32] one, please write me 500 words about the
[06:34] fall of Rome. Now system messages are
[06:36] ones that tell the model what it should
[06:38] be doing and give it additional context.
[06:40] Whereas user messages are ones that will
[06:42] actually be responded to directly. I
[06:44] then send a post request to this
[06:45] endpoint and then I'm able to get the
[06:47] response and grab the message and the
[06:48] content. So if I come here and I type UV
[06:52] run and now this is in lama local and
[06:54] then main.py it will take a second here
[06:56] because it is sending the network
[06:57] request and then we should get the
[06:59] response. Okay. And you can see that we
[07:01] get the response here 500 words about
[07:03] the fall of Rome directly inside my
[07:05] terminal. Awesome. So that is the first
[07:07] way. Now the second way to use Olama
[07:09] from code is to simply import the Olama
[07:12] module. Now, in order to use this
[07:13] module, you need to install it. So, you
[07:15] would simply type pip install lama if
[07:18] you're using something like Python. And
[07:20] then you would be able to actually
[07:21] import this module and use it from code.
[07:23] So, assuming you've installed that
[07:24] module, you'll be able to import it.
[07:26] Same thing, you can reference the model
[07:27] name, pass any messages you want, and
[07:29] this time, rather than manually calling
[07:31] the HTTP server, you can simply use the
[07:33] lama.
[07:35] Function or method, you can pass the
[07:37] model and the messages, and then you can
[07:39] get the response. This works effectively
[07:41] the exact same way as the code you saw
[07:42] before just wrapped in this nice Olama
[07:44] module. So if I run the command here you
[07:46] can see same thing it will take a second
[07:48] and then it will give me the response
[07:49] and there you go you can see we get our
[07:51] 500 words same as before showing up
[07:53] inside of the terminal. So to recap, if
[07:56] you want to use a llama, you simply
[07:58] install it, pull a particular model, and
[08:00] then you can call the HTTP REST API if
[08:03] you want to get a response or you can
[08:04] use something like the Lama module if
[08:06] you're working in a language like Python
[08:08] to directly get the reply from code.
[08:10] This is the first way to run LLMs
[08:12] locally. It's very good. It's very
[08:14] popular and you can use this inside of
[08:16] frameworks like Langchain, for example,
[08:18] or really any other AI framework that
[08:20] you want. But now I want to move on to
[08:21] the next method which is a newer one
[08:23] that many people don't know about which
[08:25] is the docker model runner. Now the next
[08:27] method of running LLMs locally is
[08:29] actually my preferred choice and that is
[08:31] to use the docker model runner. Strictly
[08:34] speaking this is a better more efficient
[08:36] way to run models than using a llama. It
[08:39] works with more systems. It has better
[08:41] GPU acceleration and support and it
[08:43] works inside of containerized
[08:45] applications especially when you're
[08:46] moving to deployment. I'm not going to
[08:48] cover the entire setup and all of the
[08:50] configuration in this video. I'll give
[08:52] you a quick intro so you can see how it
[08:53] works, but I do have a longer video
[08:55] which I'll link in the description which
[08:56] explains how to use this inside of
[08:58] things like containerized applications
[09:00] that really gets the benefit of using
[09:03] Docker Model Runner. So anyways, just
[09:05] understand this is a better solution in
[09:07] almost every single way. You don't need
[09:09] to know why or all of the specific
[09:10] details. So I would suggest using this
[09:12] one and I'll show you how to set it up.
[09:15] Okay, so Docker Model Runner allows you
[09:17] to simply run models using Docker. So in
[09:20] order to use this, you do need to have
[09:22] Docker Desktop installed. If you don't
[09:24] already have it, it is completely free
[09:25] to download. And you can simply just
[09:27] download it by going to Docker Desktop,
[09:29] right? Searching for it, and then
[09:30] finding the download link like this. So
[09:32] you can download it for your operating
[09:34] system. Now, once you have Docker
[09:35] Desktop downloaded, make sure that you
[09:37] have the newest version on your computer
[09:39] and simply open it up. Similarly to
[09:41] before, you can go to your spotlight
[09:42] search and just search for Docker
[09:44] Desktop, right? And then open the
[09:46] application. Same thing if you're on
[09:48] Windows. Now, once you open the
[09:49] application, what you're going to want
[09:50] to navigate to here is this settings
[09:52] window. You're going to want to go to AI
[09:55] and make sure that you enable the Docker
[09:57] model runner. Okay? So, settings, AI,
[09:59] enable docker model runner, and then
[10:02] enable the host side TCP support and
[10:05] change cores to say all. Now, if for
[10:08] some reason you're not seeing this, you
[10:09] can go into beta features and make sure
[10:11] that you enable the Docker MCP toolkit.
[10:14] Once you do that, all of this should be
[10:16] good. You can enable this setting and
[10:18] you're good to go to run the model
[10:20] runner. Now, once you've enabled that,
[10:21] you can actually go directly inside of
[10:23] Docker Desktop, go to models, and from
[10:26] here, you can actually pull models and
[10:28] start using them directly inside of
[10:29] Docker. So, right now, we're looking at
[10:31] my local models. You can see that I have
[10:32] one here called small 2. But if I change
[10:35] over to Docker Hub, then there's a list
[10:37] of models that I can download simply
[10:39] from this UI. So, same thing as before,
[10:41] I'll just use that small two model. So,
[10:42] I can just type small. And you can see
[10:44] there's a bunch of options here. Maybe
[10:45] we want to go small three now because
[10:47] this is the updated version. I'll just
[10:49] find the smallest one, which is latest.
[10:51] And then I can just press download here
[10:53] to pull. So, I can do this directly from
[10:55] the user interface. And then I'll be
[10:56] able to run the model right inside of
[10:58] Docker Desktop. For example, this model
[11:00] that's already downloaded. I can just go
[11:02] here to run and I can start chatting
[11:04] with it directly inside of this UI. You
[11:06] can see if I say hello, it just gives me
[11:07] the response. Now I can also go into
[11:09] inspect. I can see all the information
[11:11] and I can view all of the requests that
[11:13] have been sent, the context, usage,
[11:14] duration, etc., etc. So that's great. We
[11:17] can do that directly from the UI, which
[11:19] in my opinion is a little bit easier
[11:20] than a llama or we can do it directly
[11:22] from the command line. So same thing as
[11:24] before, if you have docker model runner
[11:26] enabled, you can simply type the command
[11:28] docker model. When you do this, it will
[11:30] give you a very similar UI to O Lama.
[11:33] From here, you could type docker model
[11:35] pull and then you can pull a particular
[11:37] model. So again, same thing, small m2 if
[11:39] we wanted to pull that and then it would
[11:41] start downloading it. You can see in my
[11:42] case, it's cached. I already have it. We
[11:44] can type docker model list. If we do
[11:48] that, it gives us a list of the local
[11:49] models that we have available. And then
[11:51] there's a few other ones that we can
[11:52] look at here, right? Like packaging
[11:53] models, listing them, inspecting,
[11:55] running, deleting, etc., etc. So also
[11:58] from here we can go docker model run
[12:01] small m2. When we do that brings us to
[12:03] the interactive terminal. Same thing we
[12:05] can type hello. If we want to escape we
[12:07] can type /by and it will remove us from
[12:09] there. Okay. So effectively same thing
[12:12] as a llama just a slightly better user
[12:13] interface in my opinion. If you're
[12:15] looking for all of the different models
[12:16] you can find them directly from the UI
[12:18] here or you can go to the hub.docker.com
[12:22] which I will link in the description and
[12:24] search for all of the available models
[12:25] on this page. Cool. So with that said,
[12:27] similarly to before, let me show you how
[12:29] you can use the Docker model runner
[12:31] directly from code. So what you're
[12:33] seeing on my screen is an example of
[12:34] Python code that sends a request to
[12:37] docker model runner to use the same
[12:39] model as before, the small M2 model,
[12:41] this time from docker model runner as
[12:43] opposed to now. You'll notice that the
[12:46] only thing that's changed about this
[12:48] code is simply the port that I am
[12:50] calling. Now, Olama, if you remember
[12:52] before, runs an HTTP REST API on port
[12:56] 11434.
[12:57] Docker Model Runner runs it on port
[13:00] 12434. So, the only change is that it's
[13:03] 12 versus 11. So, all I have to do is
[13:06] simply change my URL here to be 12434
[13:10] and slightly change the endpoint or path
[13:13] that you can see. And similarly to
[13:15] before, I can send a request directly to
[13:17] Docker Model Runner and get a response.
[13:20] That is because just like Olama, Docker
[13:21] model runner runs its own REST API that
[13:24] you can call. So what I can do here is
[13:26] type uvron docker local/main.py
[13:31] and you'll see that it will just take a
[13:32] second here and it will give me the
[13:33] response. Okay. And there we go. You can
[13:36] see I get the 500word response popping
[13:38] up directly here. Now similarly to that
[13:41] we also can directly use modules like
[13:43] the OpenAI module. So previously I used
[13:46] the Olama module. Obviously, we're not
[13:47] going to use the Olava module when we're
[13:49] working with Docker Model Runner.
[13:51] However, there are modules like OpenAI
[13:53] or Langchain. Now, those by default are
[13:56] going to use something like OpenAI's
[13:58] public API. However, they have the
[14:01] ability to actually override the base
[14:03] URL. So, we can actually change the base
[14:06] URL to be the Docker model runner, which
[14:08] is running locally on our own computer.
[14:10] And now we can use this module just like
[14:13] we would before with all of the nice
[14:15] completions and methods except instead
[14:17] of it sending a request to OpenAI which
[14:19] we would need to pay for, we can send it
[14:21] locally to our own machine. So you can
[14:24] see that if I want to specify the model
[14:25] name, I do AI slash and then the name of
[14:28] the model. So in this case, small M2. I
[14:30] can specify a prompt and then I can use
[14:32] the OpenAI module just like I would for
[14:34] any OpenAI request. So for example, UV
[14:37] run docker local slash and then this
[14:40] open AI module. Same thing just wait a
[14:43] second and you can see it explains how
[14:45] transformers work. Boom. So there you
[14:47] go. That is how you use it directly from
[14:49] code. Now a lot of you at this point
[14:51] might be asking okay well what's the
[14:52] difference between the docker model
[14:54] runner and strictly speaking the docker
[14:56] model runner is more optimized works
[14:58] better with GPUs has more support and
[15:01] also works for containerized
[15:03] applications. So right now I'm just
[15:05] showing you a very basic app. However,
[15:07] if I had this wrapped inside of a Docker
[15:09] container, I can actually expose the
[15:11] Docker model runner directly inside of
[15:13] that container. I can have the model
[15:16] actually wrapped in the container so
[15:17] that it doesn't take up a massive amount
[15:19] of space and then I can run the model
[15:21] directly inside of the containerized
[15:23] app. Now, I'm not going to explain that
[15:25] in this video because I know a lot of
[15:26] you won't find that that useful.
[15:28] However, I do have a video on YouTube,
[15:30] I'll link it in the description, that
[15:31] explains how to do this, where
[15:32] essentially you can expose an LLM
[15:34] service to your Docker container in
[15:36] something like a Docker Compose file
[15:38] that allows you to specify what model
[15:40] you want to run on what port, etc.,
[15:43] etc., and that will allow it to ingest
[15:45] the AI model in just a much better way,
[15:47] especially when it comes to deploying
[15:48] this application out. So, with that
[15:50] said, that's going to wrap up this
[15:52] video. I hope that you found this
[15:53] useful. If you did, make sure to leave a
[15:55] like, subscribe, and I will see you in
[15:57] the next one.
[16:00] [music]