---
title: 'How to Run LLMs Locally - Full Guide'
source: 'https://youtube.com/watch?v=km5-0jhv0JI'
video_id: 'km5-0jhv0JI'
date: 2026-06-16
duration_sec: 966
---

# How to Run LLMs Locally - Full Guide

> Source: [How to Run LLMs Locally - Full Guide](https://youtube.com/watch?v=km5-0jhv0JI)

## Summary

This video demonstrates two methods for running large language models (LLMs) locally: Ollama and Docker Model Runner. Both are free, high-performance alternatives to hosted solutions, offering benefits in speed, privacy, and cost. The tutorial covers downloading models, running them interactively, and integrating them into code via REST APIs or Python modules.

### Key Points

- **Why Run LLMs Locally** [0:00] — Running LLMs locally provides speed, privacy, and cost benefits over hosted solutions like ChatGPT.
- **Ollama Introduction** [0:39] — Ollama is a popular open-source tool for downloading and managing local LLMs, accessible via command line and code.
- **Installing Ollama** [1:04] — Download Ollama from ollama.com for Mac, Windows, or Linux, then ensure it's running (check system tray or taskbar).
- **Pulling a Model with Ollama** [1:42] — Use `ollama pull <model-name>` to download a model. For testing, start with a small model like 'small-m2' (271 MB).
- **Running a Model Interactively** [3:10] — Use `ollama run <model-name>` to enter an interactive chat. Type `/bye` to exit.
- **Using Ollama from Code via REST API** [5:49] — Ollama exposes an HTTP server on port 11434. Send POST requests to `/api/chat` with model and messages to get responses.
- **Using Ollama from Code via Python Module** [7:09] — Install the `ollama` Python module (`pip install ollama`) and use `ollama.chat()` for a simpler interface.
- **Docker Model Runner Introduction** [8:27] — Docker Model Runner is a newer, more efficient method with better GPU acceleration and container support.
- **Setting Up Docker Model Runner** [9:15] — Install Docker Desktop (latest version), enable the AI model runner in settings, and enable host-side TCP support.
- **Pulling and Running Models in Docker UI** [10:21] — In Docker Desktop, go to Models, pull models from Docker Hub, and run them interactively from the UI.
- **Using Docker Model Runner from CLI** [11:22] — Use `docker model pull`, `docker model list`, and `docker model run` commands similar to Ollama.
- **Using Docker Model Runner from Code** [12:46] — Docker Model Runner runs on port 12434. Change the URL from 11434 to 12434 and adjust the endpoint path.
- **Using OpenAI Module with Docker Model Runner** [13:41] — Override the base URL of the OpenAI module to point to Docker Model Runner (localhost:12434) to use local models with familiar API.

### Conclusion

Both Ollama and Docker Model Runner are effective for running LLMs locally, with Docker Model Runner being more optimized for containerized deployments. The video provides practical steps to get started with either method, from installation to code integration.

## Transcript

If you're not running LLMs locally, then
you're missing out. Sure, Chat GBT and
other hosted solutions are great, but if
you care about speed, privacy, and cost,
then you'll want to learn how to run
them on your own machine. That's why in
this video, I'll show you two methods of
running LLMs locally from a developer
perspective. I'll show you how to
download various models, run them
interactively, and integrate them into
your code to replace something like the
OpenAI API. Now, the two methods that
I'm going to show you are O Lama and
Docker Model Runner. Both are free,
highly performant, and you'll learn the
differences between them as I go through
this video. Let's get started. So, the
first method to show you is Olama. Now,
Olama is a popular open- source software
that allows you to download and run
models locally on your own computer.
Now, the models that you can run do
depend on the hardware and the specs of
that computer, but generally speaking,
you can run models, you can manage them,
delete them, etc., and then access those
same models from code. So, let me give
you a quick overview into how Olama
works. Let's get started. So, first
things first, go to ola.com and simply
download O Lama for Mac, Windows, or
Linux. Once it's downloaded, what you'll
want to do is make sure it's running.
So, you can go to something like your
Spotlight search, search for O Lama, and
simply run it and just make sure it's
loaded on your computer. If you're on
Windows, usually you can see if it's
running by looking in the bottom right
hand corner in kind of the activity bar
and seeing if that Lama icon is
appearing. Same thing, you can also kind
of go to the explorer search, search for
Olama, just double press it. Make sure
it's running on your computer. Now, once
Olama is running, what you're going to
want to do is head over to your
terminal. So, open up a terminal or a
command prompt and simply type in the
command Olama. If you do that and the
command works, it means running and
you're good to go. Now from here it's
very straightforward on how to use this
tool but what you're able to do is
download and manage various different
models. So before you can start using
this from code you need to actually pull
a model. So the command to pull a model
is lama pull and then the name of the
model. Now if you're looking for the
names of the models you can find them
from this Olama kind of search page
which I'll link in the description.
You'll see there's a ton of different
options that you can download, right?
You can go through all of them and if
you want to see which ones you're
actually able to run, you can click into
one of them, for example, and you'll see
the size of these various different
models. Now, for extremely large models,
you're going to want to have a GPU and a
lot of RAM on your computer. And if
you're not sure, you can go to something
like Chat GBT and just ask it based on
the specs of your machine if you would
be able to run this model. So, I suggest
when you're testing out just to go with
a small one. So, what I'm going to do is
just search for one called Small M2.
This is a pretty small model that should
run locally. Whether you have a GPU, a
lot of RAM, CPU, doesn't matter. And you
can see that in order to actually run it
or to pull it, use a llama run and the
name of the model, which is this right
here. So, what I'll do is I'll just copy
this name here of the one that's only
271 megabytes. I'm going to go to my
terminal. I'm going to paste it. So, a
llama pull small m2 col 135m. When I do
that, it's going to start downloading
the model to my machine. Once that's
finished, I'll be right back and we can
continue. So, the model has just
finished downloading here. And now, if I
want to start using it, what I can do is
I can start by typing LS. That's going
to show me a list of all of the
different models that I have on my
computer. You can see I have a bunch of
them here, a lot of them that I was
testing with many months ago. And if I
want to run a particular model, then I
can type run and then simply specify the
name of the model that you see here. So,
I'm going to go small M2 col135M
and then run that. And you'll see that
now it brings me into an interactive
chat where I can start typing directly
with the model. And you'll notice that
it's quite fast because it's running on
my own computer and we have no network
latency. So I'll say what is let's do
this. The capital of Canada for example
and you can see that it says the capital
of Canada is Paris. Okay. So that
obviously got that wrong. Now that's the
thing with these super small models.
They don't have access to the internet.
They're running locally on your own
computer. They can make mistakes. And
this particular one is a super small one
that's just meant to do small predictive
text. So don't expect it to be perfect.
Now if you want to exit this particular
window, you can type /by. If you do
that, it's going to bring you out of
this and then you can continue using
lama. That's pretty much all you need to
do from the command line. And now what
I'll do is show you how you can run
Olama models from code because that's
where they actually become useful after
a quick word from our sponsor boot.dev.
It's an online learning platform
designed specifically for back-end
development, and it approaches learning
in a way that's far more interactive
than the usual video-based courses.
Rather than having you sit through hours
of lectures, boot.dev puts you straight
into hands-on coding. You'll work
directly in your browser building real
projects while learning back-end
fundamentals like APIs, databases, and
serverside logic using Python and Go.
Now, what makes it stand out is the way
that it borrows from game design. You'll
progress through levels, unlock new
content, and keep your momentum up as
you go. The platform is filled with
exercises and practical challenges, so
you'll end up writing a lot of code,
which is exactly what helps you improve.
Now, all of the core content is free to
access, and if you decide to commit to
the annual plan, you can use the code
tech with Tim to get 25% off your first
year. I've been going through it myself
lately, and honestly, it's surprisingly
addictive. Thanks to boot.dev. Now,
let's get back into it. So at this
point, I'm assuming that you've
downloaded a llama, you've pulled at
least one model, you're kind of familiar
with running it, and now you probably
want to actually use this from some type
of application, right? So from code. So
in order to do that, there's a few
different ways. I'm going to show them
to you directly here inside of Python.
However, this works in pretty much every
programming language. Just make a few
small adjustments. So, whenever you have
a llama running on your computer, it's
going to expose an HTTP server or a REST
API that allows you to just directly
call the REST API and get a response.
So, you don't need to use any fancy
modules. You can just send a normal HTTP
request directly to that server and it
will give you a response from the AI
model. So, for example, this is the URL.
So, HTTP localhost port and then by
default, O Lama will run on port 1434.
You can go to the slash API/ chat. You
can pass some data. So in this case, I'm
passing the model I want to use is small
M2. I want to stream this. No, I don't.
And then I'm passing a few messages that
I want this to respond to. So the first
message is a system message saying
you're a helpful assistant. And the next
one, please write me 500 words about the
fall of Rome. Now system messages are
ones that tell the model what it should
be doing and give it additional context.
Whereas user messages are ones that will
actually be responded to directly. I
then send a post request to this
endpoint and then I'm able to get the
response and grab the message and the
content. So if I come here and I type UV
run and now this is in lama local and
then main.py it will take a second here
because it is sending the network
request and then we should get the
response. Okay. And you can see that we
get the response here 500 words about
the fall of Rome directly inside my
terminal. Awesome. So that is the first
way. Now the second way to use Olama
from code is to simply import the Olama
module. Now, in order to use this
module, you need to install it. So, you
would simply type pip install lama if
you're using something like Python. And
then you would be able to actually
import this module and use it from code.
So, assuming you've installed that
module, you'll be able to import it.
Same thing, you can reference the model
name, pass any messages you want, and
this time, rather than manually calling
the HTTP server, you can simply use the
lama.
Function or method, you can pass the
model and the messages, and then you can
get the response. This works effectively
the exact same way as the code you saw
before just wrapped in this nice Olama
module. So if I run the command here you
can see same thing it will take a second
and then it will give me the response
and there you go you can see we get our
500 words same as before showing up
inside of the terminal. So to recap, if
you want to use a llama, you simply
install it, pull a particular model, and
then you can call the HTTP REST API if
you want to get a response or you can
use something like the Lama module if
you're working in a language like Python
to directly get the reply from code.
This is the first way to run LLMs
locally. It's very good. It's very
popular and you can use this inside of
frameworks like Langchain, for example,
or really any other AI framework that
you want. But now I want to move on to
the next method which is a newer one
that many people don't know about which
is the docker model runner. Now the next
method of running LLMs locally is
actually my preferred choice and that is
to use the docker model runner. Strictly
speaking this is a better more efficient
way to run models than using a llama. It
works with more systems. It has better
GPU acceleration and support and it
works inside of containerized
applications especially when you're
moving to deployment. I'm not going to
cover the entire setup and all of the
configuration in this video. I'll give
you a quick intro so you can see how it
works, but I do have a longer video
which I'll link in the description which
explains how to use this inside of
things like containerized applications
that really gets the benefit of using
Docker Model Runner. So anyways, just
understand this is a better solution in
almost every single way. You don't need
to know why or all of the specific
details. So I would suggest using this
one and I'll show you how to set it up.
Okay, so Docker Model Runner allows you
to simply run models using Docker. So in
order to use this, you do need to have
Docker Desktop installed. If you don't
already have it, it is completely free
to download. And you can simply just
download it by going to Docker Desktop,
right? Searching for it, and then
finding the download link like this. So
you can download it for your operating
system. Now, once you have Docker
Desktop downloaded, make sure that you
have the newest version on your computer
and simply open it up. Similarly to
before, you can go to your spotlight
search and just search for Docker
Desktop, right? And then open the
application. Same thing if you're on
Windows. Now, once you open the
application, what you're going to want
to navigate to here is this settings
window. You're going to want to go to AI
and make sure that you enable the Docker
model runner. Okay? So, settings, AI,
enable docker model runner, and then
enable the host side TCP support and
change cores to say all. Now, if for
some reason you're not seeing this, you
can go into beta features and make sure
that you enable the Docker MCP toolkit.
Once you do that, all of this should be
good. You can enable this setting and
you're good to go to run the model
runner. Now, once you've enabled that,
you can actually go directly inside of
Docker Desktop, go to models, and from
here, you can actually pull models and
start using them directly inside of
Docker. So, right now, we're looking at
my local models. You can see that I have
one here called small 2. But if I change
over to Docker Hub, then there's a list
of models that I can download simply
from this UI. So, same thing as before,
I'll just use that small two model. So,
I can just type small. And you can see
there's a bunch of options here. Maybe
we want to go small three now because
this is the updated version. I'll just
find the smallest one, which is latest.
And then I can just press download here
to pull. So, I can do this directly from
the user interface. And then I'll be
able to run the model right inside of
Docker Desktop. For example, this model
that's already downloaded. I can just go
here to run and I can start chatting
with it directly inside of this UI. You
can see if I say hello, it just gives me
the response. Now I can also go into
inspect. I can see all the information
and I can view all of the requests that
have been sent, the context, usage,
duration, etc., etc. So that's great. We
can do that directly from the UI, which
in my opinion is a little bit easier
than a llama or we can do it directly
from the command line. So same thing as
before, if you have docker model runner
enabled, you can simply type the command
docker model. When you do this, it will
give you a very similar UI to O Lama.
From here, you could type docker model
pull and then you can pull a particular
model. So again, same thing, small m2 if
we wanted to pull that and then it would
start downloading it. You can see in my
case, it's cached. I already have it. We
can type docker model list. If we do
that, it gives us a list of the local
models that we have available. And then
there's a few other ones that we can
look at here, right? Like packaging
models, listing them, inspecting,
running, deleting, etc., etc. So also
from here we can go docker model run
small m2. When we do that brings us to
the interactive terminal. Same thing we
can type hello. If we want to escape we
can type /by and it will remove us from
there. Okay. So effectively same thing
as a llama just a slightly better user
interface in my opinion. If you're
looking for all of the different models
you can find them directly from the UI
here or you can go to the hub.docker.com
which I will link in the description and
search for all of the available models
on this page. Cool. So with that said,
similarly to before, let me show you how
you can use the Docker model runner
directly from code. So what you're
seeing on my screen is an example of
Python code that sends a request to
docker model runner to use the same
model as before, the small M2 model,
this time from docker model runner as
opposed to now. You'll notice that the
only thing that's changed about this
code is simply the port that I am
calling. Now, Olama, if you remember
before, runs an HTTP REST API on port
11434.
Docker Model Runner runs it on port
12434. So, the only change is that it's
12 versus 11. So, all I have to do is
simply change my URL here to be 12434
and slightly change the endpoint or path
that you can see. And similarly to
before, I can send a request directly to
Docker Model Runner and get a response.
That is because just like Olama, Docker
model runner runs its own REST API that
you can call. So what I can do here is
type uvron docker local/main.py
and you'll see that it will just take a
second here and it will give me the
response. Okay. And there we go. You can
see I get the 500word response popping
up directly here. Now similarly to that
we also can directly use modules like
the OpenAI module. So previously I used
the Olama module. Obviously, we're not
going to use the Olava module when we're
working with Docker Model Runner.
However, there are modules like OpenAI
or Langchain. Now, those by default are
going to use something like OpenAI's
public API. However, they have the
ability to actually override the base
URL. So, we can actually change the base
URL to be the Docker model runner, which
is running locally on our own computer.
And now we can use this module just like
we would before with all of the nice
completions and methods except instead
of it sending a request to OpenAI which
we would need to pay for, we can send it
locally to our own machine. So you can
see that if I want to specify the model
name, I do AI slash and then the name of
the model. So in this case, small M2. I
can specify a prompt and then I can use
the OpenAI module just like I would for
any OpenAI request. So for example, UV
run docker local slash and then this
open AI module. Same thing just wait a
second and you can see it explains how
transformers work. Boom. So there you
go. That is how you use it directly from
code. Now a lot of you at this point
might be asking okay well what's the
difference between the docker model
runner and strictly speaking the docker
model runner is more optimized works
better with GPUs has more support and
also works for containerized
applications. So right now I'm just
showing you a very basic app. However,
if I had this wrapped inside of a Docker
container, I can actually expose the
Docker model runner directly inside of
that container. I can have the model
actually wrapped in the container so
that it doesn't take up a massive amount
of space and then I can run the model
directly inside of the containerized
app. Now, I'm not going to explain that
in this video because I know a lot of
you won't find that that useful.
However, I do have a video on YouTube,
I'll link it in the description, that
explains how to do this, where
essentially you can expose an LLM
service to your Docker container in
something like a Docker Compose file
that allows you to specify what model
you want to run on what port, etc.,
etc., and that will allow it to ingest
the AI model in just a much better way,
especially when it comes to deploying
this application out. So, with that
said, that's going to wrap up this
video. I hope that you found this
useful. If you did, make sure to leave a
like, subscribe, and I will see you in
the next one.
[music]