TubeSum ← Transcribe a video

Run Your Own LLM on a Server - Ollama + Gemma 3

Transcribed Jun 18, 2026 Watch on YouTube ↗
Intermediate 5 min read For: Developers and tech enthusiasts with basic server administration knowledge who want to self-host LLMs.
3.6K
Views
151
Likes
9
Comments
14
Dislikes
4.5%
🔥 High Engagement

AI Summary

This video provides a step-by-step tutorial on self-hosting a large language model (LLM) using Ollama and Gemma 3 on a VPS. It covers the reasons for self-hosting, the distinction between Ollama and models, server selection, setup, API exposure, and building a simple chat application. The tutorial also discusses tradeoffs between self-hosting and hosted APIs.

[0:24]
Reasons for Self-Hosting

Privacy, predictable cost, working on private networks, and version control are the four main reasons to self-host an LLM.

[0:38]
Ollama vs. Models

Ollama is the runtime that loads and runs models; models like Gemma, Llama, Mistral are the neural networks.

[1:08]
Server vs. Laptop

A server with enough RAM (e.g., 8GB for 4B models) and always-on connectivity is recommended over a laptop.

[1:25]
VPS Selection

Hostinger VPS with full root access, NVMe SSD, AMD EPYC processors, and a one-click LLM template is used.

[2:48]
Plan Tiers

KVM 2 (2 vCPUs, 8GB RAM) for 3B/4B models; KVM 4 for 7B models.

[4:44]
Pulling Gemma 3

The template preloads Ollama; we pull Gemma 3 4B instead of the default Llama 3.

[5:55]
API Access

Ollama serves a REST API on port 11434 and an OpenAI-compatible endpoint at /v1/chat/completions.

[6:37]
Security

Add authentication (IP whitelisting or reverse proxy) if exposing the API publicly.

[7:03]
Building a Chat App

A Node.js/Express backend streams tokens from Ollama to a plain HTML frontend in real time.

[9:09]
Tradeoffs

Self-hosting wins for learning, privacy, side projects, and predictable costs; hosted APIs win for frontier models and high throughput.

Clickbait Check

95% Legit

"The title accurately reflects the content: the video delivers a complete tutorial on running an LLM on a server, including setup, API exposure, and building a chat app."

Mentioned in this Video

Tutorial Checklist

1 1:25 Provision a VPS with at least 8GB RAM (e.g., Hostinger KVM 2).
2 3:40 Select the Ollama template during OS installation (Ubuntu + Ollama).
3 4:26 Verify Ollama is installed by running 'ollama --version'.
4 4:44 Pull the desired model: 'ollama pull gemma3:4b'.
5 5:15 Test the model via CLI: 'ollama run gemma3:4b'.
6 5:55 Access the model via API: POST to http://<server-ip>:11434/api/generate with JSON body.
7 7:17 Create a Node.js project: 'mkdir chat-app && cd chat-app && npm init -y && npm install express'.
8 7:22 Create server.js with Express, serving static files and a /chat endpoint that streams from Ollama.
9 8:22 Create a public/index.html with a textarea, button, and output div, and client-side JavaScript to fetch and stream the response.
10 8:39 Start the server: 'node server.js' and access it at http://<server-ip>:3000.

Study Flashcards (9)

What is Ollama?

easy Click to reveal answer

Ollama is the runtime that loads a model into memory and handles requests.

0:38

Name two open-weight models mentioned in the video.

easy Click to reveal answer

Gemma, Llama, Mistral, Qwen.

0:54

What are the four real reasons people self-host LLMs?

medium Click to reveal answer

Privacy, predictable cost, working on private networks, and version control.

0:24

What are the key features of the Hostinger VPS plan used in the tutorial?

medium Click to reveal answer

Full root access, NVMe SSD storage, AMD EPYC processors with KVM virtualization, and a one-click LLM template.

1:49

Which Hostinger KVM plan is recommended for Gemma 3 4B, and which for 7B models?

hard Click to reveal answer

KVM 2 (2 vCPUs, 8GB RAM, 100GB NVMe) for 3B or 4B models; KVM 4 for 7B models.

2:48

What port does Ollama's REST API run on, and what is the OpenAI-compatible endpoint?

medium Click to reveal answer

Ollama serves a REST API on port 11434 and an OpenAI-compatible endpoint at /v1/chat/completions.

5:55

How should you secure Ollama's API if exposed to the public internet?

hard Click to reveal answer

Add authentication by whitelisting IPs in the firewall or putting a reverse proxy with an API key in front.

6:37

When does self-hosting an LLM win over hosted APIs?

medium Click to reveal answer

Self-hosting wins for learning, privacy-sensitive work, side projects, internal tools, predictable costs at low to medium volume, offline use, and pinning a model version.

9:17

When do hosted APIs still win over self-hosting?

medium Click to reveal answer

Hosted APIs win for frontier capability, high-throughput production, zero ops overhead, and the largest models.

9:31

💡 Key Takeaways

🔧

Ollama vs. Models Distinction

Clarifies the common confusion between the runtime (Ollama) and the model (e.g., Gemma), which is foundational for understanding self-hosting.

0:38
📊

Server Requirements for LLMs

Provides specific hardware requirements (RAM, CPU, storage) for running LLMs, which is critical for practical deployment.

1:49
🔧

Ollama's OpenAI-Compatible API

Shows how to use Ollama's API with existing OpenAI SDK code, making migration seamless for developers.

5:55
💡

Tradeoffs: Self-Hosting vs. Hosted APIs

Provides a balanced view of when to self-host versus using hosted APIs, helping viewers make informed decisions.

9:17
⚖️

Hybrid Approach

Suggests using both self-hosted and hosted models for different tasks, which is a practical strategy for real-world systems.

9:51

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Run Your Own LLM Without Paying Per Token

30s

Addresses the cost and privacy pain points of using hosted APIs, sparking curiosity about self-hosting.

▶ Play Clip

Ollama vs Models: The Key Difference

40s

Clears up a common confusion between runtime and model, essential for anyone new to local LLMs.

▶ Play Clip

Why Your VPS Specs Matter for LLMs

53s

Provides actionable hardware advice, helping viewers avoid costly mistakes when choosing a server.

▶ Play Clip

Self-Hosting vs API: The Real Cost

45s

Delivers a practical cost comparison that helps viewers decide between fixed monthly bills and per-token pricing.

▶ Play Clip

When NOT to Self-Host an LLM

53s

Balances the hype with honest tradeoffs, making the advice credible and shareable among developers.

▶ Play Clip

[00:06] Can you actually run a real LLM

[00:09] yourself? No OpenAI key, no Anthropic

[00:12] key, no per token bill, just a model

[00:14] running on a server you control. Yes,

[00:17] and in the next few minutes we're going

[00:19] to set one up, expose it as an API, and

[00:21] build a small chat app that talks to it.

[00:24] There are four real reasons people do

[00:25] this: privacy, predictable cost, working

[00:28] on private networks, and version

[00:30] control. We'll also talk about when

[00:32] self-hosting is the wrong call, because

[00:34] there are plenty of those cases, and

[00:36] pretending otherwise wastes your time.

[00:38] Quick distinction before we touch

[00:39] anything. These two things get mixed up

[00:41] constantly. Ollama is the runtime. It's

[00:45] the program that loads a model into

[00:46] memory and handles requests. Think of it

[00:49] like a JVM. By itself, it doesn't do

[00:51] anything useful, but it knows how to run

[00:53] things.

[00:54] Gemma, Llama, Mistral, QN, those are the

[00:58] models, the actual neural networks,

[01:00] multi-gigabyte files of weights that

[01:02] produce text. You can swap models

[01:05] anytime. Pull a new one, point Ollama at

[01:07] it done.

[01:08] Could you do this on your laptop? Sure.

[01:11] Should you? Probably not. Laptops sleep,

[01:14] models eat RAM, and you don't want a 4

[01:16] billion parameter model warming your CPU

[01:18] mid-call. So, a server, always on,

[01:21] reachable with enough RAM to hold a

[01:23] model without choking.

[01:25] So, for this tutorial, I'll be using a

[01:27] Hostinger VPS, and I want to walk

[01:30] through why, because the plan choice is

[01:32] important here. This isn't a generic any

[01:34] server works situation. Memory and CPU

[01:37] matter here, more than for a regular web

[01:40] app.

[01:40] This is Hostinger's VPS hosting page,

[01:43] their AI managed VPS line.

[01:46] Before we continue, a few things are

[01:47] worth calling out.

[01:49] Full root access, because Ollama runs as

[01:51] a system service, we need to install it,

[01:54] expose ports, and adjust firewall rules.

[01:57] Locked-down hosts won't let you do that.

[01:59] NVMe SSD storage, that's relevant

[02:02] because model files are big. Gemma 3 4B

[02:05] is roughly 3.3 gigs on disk. A 7B model

[02:08] is around 4 to 5. Loading from a slow

[02:11] disk means slow cold starts. AMD EPYC

[02:14] processors, KVM virtualization. KVM

[02:18] gives you a real virtual machine with

[02:20] allocated CPU and RAM. Stronger

[02:22] isolation than the kind of shared

[02:24] container hosting where someone else's

[02:26] workload can steal your CPU. The

[02:28] one-click LLM template, specifically an

[02:31] Ubuntu image preloaded with Ollama, the

[02:33] Llama 3 model as a default, and Open Web

[02:36] UI as a browser chat interface. We'll

[02:38] use the template, skip the default

[02:40] model, and pull Gemma 3 ourselves. Still

[02:43] saves us 5 minutes of setup. About the

[02:45] plan. There are four KVM tiers. I'm

[02:48] picking KVM 2, two vCPUs, 8 gigs of RAM,

[02:51] 100 GB NVMe. Gemma 3 4B in its 4-bit

[02:55] version uses about 3 to 5 gigs of RAM in

[02:57] practice. The model file is 3.3 gigs on

[03:00] disk plus context as it grows. OS and

[03:03] runtime take another 1 to 2. That leaves

[03:05] a gig or two of headroom on eight.

[03:07] Tight, but it's enough for a 4B model

[03:09] and a small app on top. If you want to

[03:11] step up to 7B models, that's KVM 4

[03:14] territory. 7B in 4-bit lands in the 5 to

[03:17] 7 gig range, which doesn't fit cleanly

[03:20] on eight. So, KVM 2 if you're staying

[03:22] with 3B or 4B, which is where we're

[03:24] going for this build. KVM 4 if you want

[03:27] to experiment with 7B.

[03:29] If you want to follow along, the coupon

[03:31] LTS10 takes 10% off whichever plan you

[03:34] pick. Link is hosting.your.com/lts10.

[03:38] Also in the description.

[03:40] After provisioning, you land in Hpanel.

[03:43] From the VPS section, head into OS and

[03:46] panel, click operating system, pick

[03:49] Ubuntu, then select applications, and

[03:53] pick Ollama, and the template will take

[03:55] care of VPS setup. H panel gives you the

[03:58] VPS overview, free weekly backups,

[04:02] firewall settings, a browser-based

[04:04] terminal, meaning no SSH key fiddling if

[04:07] you don't feel like it. One click and

[04:09] you're inside the box. That icon is

[04:12] Cody, their AI assistant. You can hand

[04:15] it a task in plain English and it'll

[04:17] actually run the VPS commands itself. No

[04:20] copy-paste needed. Worth knowing about,

[04:23] even if we won't use it today.

[04:26] Because the template ships with Ollama,

[04:28] we can just verify it's there. It prints

[04:30] the version, so we're good. If you're on

[04:32] a plain Ubuntu install instead, this

[04:34] one-liner will do the job. Same outcome,

[04:37] 60 seconds longer.

[04:44] First, let's see what's already there.

[04:47] The template came with Llama 3 already

[04:49] pulled.

[04:50] Fine model, but we're going with Gemma 3

[04:53] 4B for this build. Google's open-weight

[04:55] model. Smaller, fast on CPU, fits on our

[04:59] 8-gig box. The first pull is the slow

[05:01] part.

[05:02] About 3.3 gigs of the wire, I've sped

[05:05] this up. After it's local, it's local.

[05:08] You won't pull it again.

[05:10] Ollama list confirms what's installed.

[05:12] Both models sitting there ready to go.

[05:15] To actually talk to it,

[05:17] the first request takes a few seconds,

[05:19] the model has to load into RAM. After

[05:21] that, the next prompts are fast.

[05:23] There it is. The model we just pulled

[05:25] replying from our own server. Nothing

[05:28] external touched.

[05:29] Now, the cost. What's this taking in

[05:32] RAM?

[05:33] With the model loaded, we're using about

[05:35] 4 to 5 gigs of RAM, more than half our

[05:37] 8-gig box. Self-hosting isn't free.

[05:40] Instead of paying per token, you're

[05:42] paying a fixed monthly bill for the

[05:44] server. If your traffic is bursty small,

[05:47] hosted APIs are usually cheaper. If it's

[05:51] steady or you've got privacy

[05:52] constraints, self-hosting wins.

[05:55] Now, we can call it from code. Ollama's

[05:58] been serving a REST API on port 11343

[06:01] the whole time. Nothing else to start,

[06:03] nothing else to configure.

[06:05] We hit the generate endpoint with a

[06:07] simple prompt. Standard JSON, model

[06:09] name, generated text, token counts,

[06:12] timings, the usual fields you'd want.

[06:15] Ollama also exposes an OpenAI compatible

[06:17] endpoint at v1/chat/completions.

[06:21] If you have code using the OpenAI SDK,

[06:24] just point its base URL at your server.

[06:27] The basic chat completions endpoint

[06:29] works. One-line change and you've moved

[06:31] off OpenAI. Not every OpenAI endpoint is

[06:34] mirrored, but the everyday ones are

[06:36] there.

[06:37] If you ever open port 11343 to the

[06:40] public internet, add authentication.

[06:42] Ollama doesn't ship with off by default.

[06:45] The HPanel firewall section is where

[06:47] you'd lock things down. Whitelist your

[06:49] app's IP or put a reverse proxy in front

[06:52] with an API key.

[06:53] For this video, we're keeping everything

[06:55] on the same server. The app and Ollama

[06:57] will talk over localhost, so we don't

[06:59] need to expose the port at all.

[07:03] Time to build something.

[07:05] Keeping it deliberately simple, a chat

[07:07] UI that streams responses from our

[07:08] model. Node and Express on the back,

[07:11] plain HTML on the front, around 80 lines

[07:13] total, and it'll teach you the pattern

[07:15] you can use in anything bigger.

[07:17] Make the folder, init a package, install

[07:20] Express. Done.

[07:22] Now the server. Standard Express setup.

[07:25] We serve a static front end from a

[07:27] public folder and accept JSON bodies.

[07:30] Nothing fancy.

[07:31] One endpoint, post chat. It takes a

[07:34] prompt from the request body, forwards

[07:36] it to Ollama with stream true, and pipes

[07:39] the stream of tokens back to the browser

[07:42] as they arrive.

[07:43] Ollama uses newline delimited JSON for

[07:46] streaming. Each line is one chunk of the

[07:49] response.

[07:50] We buffer the chunks, split on new

[07:53] lines, parse each line, and write the

[07:55] response field back to the browser. The

[07:58] browser sees tokens appear in real time,

[08:00] same feel as ChatGPT.

[08:02] The buffer is required as TCP chunks

[08:05] don't always end on a JSON line. One

[08:08] line can get split across two chunks.

[08:11] Without it, you'd hit JSON.parse errors

[08:13] on partial lines, maybe one request in

[08:16] 15.

[08:17] Failures that look random until you spot

[08:19] the cause.

[08:22] The front end is very simple. Text area

[08:24] for the prompt, a div for the output, a

[08:27] button.

[08:28] The script does one thing. Fetch to our

[08:30] chat endpoint, read the response body as

[08:32] a stream, and append each piece to the

[08:35] output div as it arrives. That's the

[08:37] full client.

[08:39] Start the server, open the VPS IP on

[08:42] port 3000. I'll ask it for a short poem

[08:44] about a tired developer at 2:00 a.m. Hit

[08:46] send and tokens stream back in real

[08:48] time.

[08:49] The full loop. Your code, your server,

[08:52] your model. No external API in the

[08:55] picture, no token meter anymore.

[08:58] From here, you'd add chat history,

[09:01] system prompts, maybe RAG over your own

[09:03] documents. Same loop underneath, just

[09:06] more layers on top.

[09:09] Before anyone uninstalls the OpenAI SDK,

[09:12] let's talk about the tradeoffs. Pick the

[09:14] wrong tool for the job and you'll regret

[09:16] it.

[09:17] Self-hosting wins for learning, for

[09:19] privacy-sensitive work, for side

[09:21] projects and internal tools, for

[09:23] predictable costs at low to medium

[09:24] volume, for offline use, and for pinning

[09:27] a model version so the vendor can't

[09:29] change it on you mid-project.

[09:31] Hosted APIs still win for frontier

[09:33] capability, for high-throughput

[09:35] production, for the zero ops overhead

[09:38] and for the largest models you wouldn't

[09:40] realistically run yourself.

[09:42] A 4B open model handles plenty of jobs.

[09:45] The largest closed models still pull

[09:47] ahead on the hardest ones. Pick the

[09:49] right tool for the job you're trying to

[09:50] do.

[09:51] Different tools for different jobs. Most

[09:54] real systems end up using both. Hosted

[09:56] APIs for the hard reasoning, self-hosted

[09:59] for high volume traffic or anything

[10:01] privacy sensitive.

[10:03] That's it.

[10:05] Pull a model, run it, expose it on a

[10:07] port, point an app at it. Once you've

[10:10] done it once, you can do it for any open

[10:12] weight model. Llama, Mistral, Qwen,

[10:15] DeepSeek, whatever drops next week.

[10:18] If you want, spin up your own. The

[10:19] coupon LTS10 takes 10% off at

[10:22] hostingor.com. Links in the description.

[10:25] Thanks for watching. See you in the next

[10:26] one.

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.