---
title: 'Run Your Own LLM on a Server - Ollama + Gemma 3'
source: 'https://youtube.com/watch?v=DHjazHCUlow'
video_id: 'DHjazHCUlow'
date: 2026-06-18
duration_sec: 0
---

# Run Your Own LLM on a Server - Ollama + Gemma 3

> Source: [Run Your Own LLM on a Server - Ollama + Gemma 3](https://youtube.com/watch?v=DHjazHCUlow)

## Summary

This video provides a step-by-step tutorial on self-hosting a large language model (LLM) using Ollama and Gemma 3 on a VPS. It covers the reasons for self-hosting, the distinction between Ollama and models, server selection, setup, API exposure, and building a simple chat application. The tutorial also discusses tradeoffs between self-hosting and hosted APIs.

### Key Points

- **Reasons for Self-Hosting** [0:24] — Privacy, predictable cost, working on private networks, and version control are the four main reasons to self-host an LLM.
- **Ollama vs. Models** [0:38] — Ollama is the runtime that loads and runs models; models like Gemma, Llama, Mistral are the neural networks.
- **Server vs. Laptop** [1:08] — A server with enough RAM (e.g., 8GB for 4B models) and always-on connectivity is recommended over a laptop.
- **VPS Selection** [1:25] — Hostinger VPS with full root access, NVMe SSD, AMD EPYC processors, and a one-click LLM template is used.
- **Plan Tiers** [2:48] — KVM 2 (2 vCPUs, 8GB RAM) for 3B/4B models; KVM 4 for 7B models.
- **Pulling Gemma 3** [4:44] — The template preloads Ollama; we pull Gemma 3 4B instead of the default Llama 3.
- **API Access** [5:55] — Ollama serves a REST API on port 11434 and an OpenAI-compatible endpoint at /v1/chat/completions.
- **Security** [6:37] — Add authentication (IP whitelisting or reverse proxy) if exposing the API publicly.
- **Building a Chat App** [7:03] — A Node.js/Express backend streams tokens from Ollama to a plain HTML frontend in real time.
- **Tradeoffs** [9:09] — Self-hosting wins for learning, privacy, side projects, and predictable costs; hosted APIs win for frontier models and high throughput.

## Transcript

Can you actually run a real LLM
yourself? No OpenAI key, no Anthropic
key, no per token bill, just a model
running on a server you control. Yes,
and in the next few minutes we're going
to set one up, expose it as an API, and
build a small chat app that talks to it.
There are four real reasons people do
this: privacy, predictable cost, working
on private networks, and version
control. We'll also talk about when
self-hosting is the wrong call, because
there are plenty of those cases, and
pretending otherwise wastes your time.
Quick distinction before we touch
anything. These two things get mixed up
constantly. Ollama is the runtime. It's
the program that loads a model into
memory and handles requests. Think of it
like a JVM. By itself, it doesn't do
anything useful, but it knows how to run
things.
Gemma, Llama, Mistral, QN, those are the
models, the actual neural networks,
multi-gigabyte files of weights that
produce text. You can swap models
anytime. Pull a new one, point Ollama at
it done.
Could you do this on your laptop? Sure.
Should you? Probably not. Laptops sleep,
models eat RAM, and you don't want a 4
billion parameter model warming your CPU
mid-call. So, a server, always on,
reachable with enough RAM to hold a
model without choking.
So, for this tutorial, I'll be using a
Hostinger VPS, and I want to walk
through why, because the plan choice is
important here. This isn't a generic any
server works situation. Memory and CPU
matter here, more than for a regular web
app.
This is Hostinger's VPS hosting page,
their AI managed VPS line.
Before we continue, a few things are
worth calling out.
Full root access, because Ollama runs as
a system service, we need to install it,
expose ports, and adjust firewall rules.
Locked-down hosts won't let you do that.
NVMe SSD storage, that's relevant
because model files are big. Gemma 3 4B
is roughly 3.3 gigs on disk. A 7B model
is around 4 to 5. Loading from a slow
disk means slow cold starts. AMD EPYC
processors, KVM virtualization. KVM
gives you a real virtual machine with
allocated CPU and RAM. Stronger
isolation than the kind of shared
container hosting where someone else's
workload can steal your CPU. The
one-click LLM template, specifically an
Ubuntu image preloaded with Ollama, the
Llama 3 model as a default, and Open Web
UI as a browser chat interface. We'll
use the template, skip the default
model, and pull Gemma 3 ourselves. Still
saves us 5 minutes of setup. About the
plan. There are four KVM tiers. I'm
picking KVM 2, two vCPUs, 8 gigs of RAM,
100 GB NVMe. Gemma 3 4B in its 4-bit
version uses about 3 to 5 gigs of RAM in
practice. The model file is 3.3 gigs on
disk plus context as it grows. OS and
runtime take another 1 to 2. That leaves
a gig or two of headroom on eight.
Tight, but it's enough for a 4B model
and a small app on top. If you want to
step up to 7B models, that's KVM 4
territory. 7B in 4-bit lands in the 5 to
7 gig range, which doesn't fit cleanly
on eight. So, KVM 2 if you're staying
with 3B or 4B, which is where we're
going for this build. KVM 4 if you want
to experiment with 7B.
If you want to follow along, the coupon
LTS10 takes 10% off whichever plan you
pick. Link is hosting.your.com/lts10.
Also in the description.
After provisioning, you land in Hpanel.
From the VPS section, head into OS and
panel, click operating system, pick
Ubuntu, then select applications, and
pick Ollama, and the template will take
care of VPS setup. H panel gives you the
VPS overview, free weekly backups,
firewall settings, a browser-based
terminal, meaning no SSH key fiddling if
you don't feel like it. One click and
you're inside the box. That icon is
Cody, their AI assistant. You can hand
it a task in plain English and it'll
actually run the VPS commands itself. No
copy-paste needed. Worth knowing about,
even if we won't use it today.
Because the template ships with Ollama,
we can just verify it's there. It prints
the version, so we're good. If you're on
a plain Ubuntu install instead, this
one-liner will do the job. Same outcome,
60 seconds longer.
First, let's see what's already there.
The template came with Llama 3 already
pulled.
Fine model, but we're going with Gemma 3
4B for this build. Google's open-weight
model. Smaller, fast on CPU, fits on our
8-gig box. The first pull is the slow
part.
About 3.3 gigs of the wire, I've sped
this up. After it's local, it's local.
You won't pull it again.
Ollama list confirms what's installed.
Both models sitting there ready to go.
To actually talk to it,
the first request takes a few seconds,
the model has to load into RAM. After
that, the next prompts are fast.
There it is. The model we just pulled
replying from our own server. Nothing
external touched.
Now, the cost. What's this taking in
RAM?
With the model loaded, we're using about
4 to 5 gigs of RAM, more than half our
8-gig box. Self-hosting isn't free.
Instead of paying per token, you're
paying a fixed monthly bill for the
server. If your traffic is bursty small,
hosted APIs are usually cheaper. If it's
steady or you've got privacy
constraints, self-hosting wins.
Now, we can call it from code. Ollama's
been serving a REST API on port 11343
the whole time. Nothing else to start,
nothing else to configure.
We hit the generate endpoint with a
simple prompt. Standard JSON, model
name, generated text, token counts,
timings, the usual fields you'd want.
Ollama also exposes an OpenAI compatible
endpoint at v1/chat/completions.
If you have code using the OpenAI SDK,
just point its base URL at your server.
The basic chat completions endpoint
works. One-line change and you've moved
off OpenAI. Not every OpenAI endpoint is
mirrored, but the everyday ones are
there.
If you ever open port 11343 to the
public internet, add authentication.
Ollama doesn't ship with off by default.
The HPanel firewall section is where
you'd lock things down. Whitelist your
app's IP or put a reverse proxy in front
with an API key.
For this video, we're keeping everything
on the same server. The app and Ollama
will talk over localhost, so we don't
need to expose the port at all.
Time to build something.
Keeping it deliberately simple, a chat
UI that streams responses from our
model. Node and Express on the back,
plain HTML on the front, around 80 lines
total, and it'll teach you the pattern
you can use in anything bigger.
Make the folder, init a package, install
Express. Done.
Now the server. Standard Express setup.
We serve a static front end from a
public folder and accept JSON bodies.
Nothing fancy.
One endpoint, post chat. It takes a
prompt from the request body, forwards
it to Ollama with stream true, and pipes
the stream of tokens back to the browser
as they arrive.
Ollama uses newline delimited JSON for
streaming. Each line is one chunk of the
response.
We buffer the chunks, split on new
lines, parse each line, and write the
response field back to the browser. The
browser sees tokens appear in real time,
same feel as ChatGPT.
The buffer is required as TCP chunks
don't always end on a JSON line. One
line can get split across two chunks.
Without it, you'd hit JSON.parse errors
on partial lines, maybe one request in
15.
Failures that look random until you spot
the cause.
The front end is very simple. Text area
for the prompt, a div for the output, a
button.
The script does one thing. Fetch to our
chat endpoint, read the response body as
a stream, and append each piece to the
output div as it arrives. That's the
full client.
Start the server, open the VPS IP on
port 3000. I'll ask it for a short poem
about a tired developer at 2:00 a.m. Hit
send and tokens stream back in real
time.
The full loop. Your code, your server,
your model. No external API in the
picture, no token meter anymore.
From here, you'd add chat history,
system prompts, maybe RAG over your own
documents. Same loop underneath, just
more layers on top.
Before anyone uninstalls the OpenAI SDK,
let's talk about the tradeoffs. Pick the
wrong tool for the job and you'll regret
it.
Self-hosting wins for learning, for
privacy-sensitive work, for side
projects and internal tools, for
predictable costs at low to medium
volume, for offline use, and for pinning
a model version so the vendor can't
change it on you mid-project.
Hosted APIs still win for frontier
capability, for high-throughput
production, for the zero ops overhead
and for the largest models you wouldn't
realistically run yourself.
A 4B open model handles plenty of jobs.
The largest closed models still pull
ahead on the hardest ones. Pick the
right tool for the job you're trying to
do.
Different tools for different jobs. Most
real systems end up using both. Hosted
APIs for the hard reasoning, self-hosted
for high volume traffic or anything
privacy sensitive.
That's it.
Pull a model, run it, expose it on a
port, point an app at it. Once you've
done it once, you can do it for any open
weight model. Llama, Mistral, Qwen,
DeepSeek, whatever drops next week.
If you want, spin up your own. The
coupon LTS10 takes 10% off at
hostingor.com. Links in the description.
Thanks for watching. See you in the next
one.