Run Your Own LLM Without Paying Per Token
30sAddresses the cost and privacy pain points of using hosted APIs, sparking curiosity about self-hosting.
▶ Play ClipThis video provides a step-by-step tutorial on self-hosting a large language model (LLM) using Ollama and Gemma 3 on a VPS. It covers the reasons for self-hosting, the distinction between Ollama and models, server selection, setup, API exposure, and building a simple chat application. The tutorial also discusses tradeoffs between self-hosting and hosted APIs.
Privacy, predictable cost, working on private networks, and version control are the four main reasons to self-host an LLM.
Ollama is the runtime that loads and runs models; models like Gemma, Llama, Mistral are the neural networks.
A server with enough RAM (e.g., 8GB for 4B models) and always-on connectivity is recommended over a laptop.
Hostinger VPS with full root access, NVMe SSD, AMD EPYC processors, and a one-click LLM template is used.
KVM 2 (2 vCPUs, 8GB RAM) for 3B/4B models; KVM 4 for 7B models.
The template preloads Ollama; we pull Gemma 3 4B instead of the default Llama 3.
Ollama serves a REST API on port 11434 and an OpenAI-compatible endpoint at /v1/chat/completions.
Add authentication (IP whitelisting or reverse proxy) if exposing the API publicly.
A Node.js/Express backend streams tokens from Ollama to a plain HTML frontend in real time.
Self-hosting wins for learning, privacy, side projects, and predictable costs; hosted APIs win for frontier models and high throughput.
"The title accurately reflects the content: the video delivers a complete tutorial on running an LLM on a server, including setup, API exposure, and building a chat app."
What is Ollama?
Ollama is the runtime that loads a model into memory and handles requests.
0:38
Name two open-weight models mentioned in the video.
Gemma, Llama, Mistral, Qwen.
0:54
What are the four real reasons people self-host LLMs?
Privacy, predictable cost, working on private networks, and version control.
0:24
What are the key features of the Hostinger VPS plan used in the tutorial?
Full root access, NVMe SSD storage, AMD EPYC processors with KVM virtualization, and a one-click LLM template.
1:49
Which Hostinger KVM plan is recommended for Gemma 3 4B, and which for 7B models?
KVM 2 (2 vCPUs, 8GB RAM, 100GB NVMe) for 3B or 4B models; KVM 4 for 7B models.
2:48
What port does Ollama's REST API run on, and what is the OpenAI-compatible endpoint?
Ollama serves a REST API on port 11434 and an OpenAI-compatible endpoint at /v1/chat/completions.
5:55
How should you secure Ollama's API if exposed to the public internet?
Add authentication by whitelisting IPs in the firewall or putting a reverse proxy with an API key in front.
6:37
When does self-hosting an LLM win over hosted APIs?
Self-hosting wins for learning, privacy-sensitive work, side projects, internal tools, predictable costs at low to medium volume, offline use, and pinning a model version.
9:17
When do hosted APIs still win over self-hosting?
Hosted APIs win for frontier capability, high-throughput production, zero ops overhead, and the largest models.
9:31
Ollama vs. Models Distinction
Clarifies the common confusion between the runtime (Ollama) and the model (e.g., Gemma), which is foundational for understanding self-hosting.
0:38Server Requirements for LLMs
Provides specific hardware requirements (RAM, CPU, storage) for running LLMs, which is critical for practical deployment.
1:49Ollama's OpenAI-Compatible API
Shows how to use Ollama's API with existing OpenAI SDK code, making migration seamless for developers.
5:55Tradeoffs: Self-Hosting vs. Hosted APIs
Provides a balanced view of when to self-host versus using hosted APIs, helping viewers make informed decisions.
9:17Hybrid Approach
Suggests using both self-hosted and hosted models for different tasks, which is a practical strategy for real-world systems.
9:51[00:06] Can you actually run a real LLM
[00:09] yourself? No OpenAI key, no Anthropic
[00:12] key, no per token bill, just a model
[00:14] running on a server you control. Yes,
[00:17] and in the next few minutes we're going
[00:19] to set one up, expose it as an API, and
[00:21] build a small chat app that talks to it.
[00:24] There are four real reasons people do
[00:25] this: privacy, predictable cost, working
[00:28] on private networks, and version
[00:30] control. We'll also talk about when
[00:32] self-hosting is the wrong call, because
[00:34] there are plenty of those cases, and
[00:36] pretending otherwise wastes your time.
[00:38] Quick distinction before we touch
[00:39] anything. These two things get mixed up
[00:41] constantly. Ollama is the runtime. It's
[00:45] the program that loads a model into
[00:46] memory and handles requests. Think of it
[00:49] like a JVM. By itself, it doesn't do
[00:51] anything useful, but it knows how to run
[00:53] things.
[00:54] Gemma, Llama, Mistral, QN, those are the
[00:58] models, the actual neural networks,
[01:00] multi-gigabyte files of weights that
[01:02] produce text. You can swap models
[01:05] anytime. Pull a new one, point Ollama at
[01:07] it done.
[01:08] Could you do this on your laptop? Sure.
[01:11] Should you? Probably not. Laptops sleep,
[01:14] models eat RAM, and you don't want a 4
[01:16] billion parameter model warming your CPU
[01:18] mid-call. So, a server, always on,
[01:21] reachable with enough RAM to hold a
[01:23] model without choking.
[01:25] So, for this tutorial, I'll be using a
[01:27] Hostinger VPS, and I want to walk
[01:30] through why, because the plan choice is
[01:32] important here. This isn't a generic any
[01:34] server works situation. Memory and CPU
[01:37] matter here, more than for a regular web
[01:40] app.
[01:40] This is Hostinger's VPS hosting page,
[01:43] their AI managed VPS line.
[01:46] Before we continue, a few things are
[01:47] worth calling out.
[01:49] Full root access, because Ollama runs as
[01:51] a system service, we need to install it,
[01:54] expose ports, and adjust firewall rules.
[01:57] Locked-down hosts won't let you do that.
[01:59] NVMe SSD storage, that's relevant
[02:02] because model files are big. Gemma 3 4B
[02:05] is roughly 3.3 gigs on disk. A 7B model
[02:08] is around 4 to 5. Loading from a slow
[02:11] disk means slow cold starts. AMD EPYC
[02:14] processors, KVM virtualization. KVM
[02:18] gives you a real virtual machine with
[02:20] allocated CPU and RAM. Stronger
[02:22] isolation than the kind of shared
[02:24] container hosting where someone else's
[02:26] workload can steal your CPU. The
[02:28] one-click LLM template, specifically an
[02:31] Ubuntu image preloaded with Ollama, the
[02:33] Llama 3 model as a default, and Open Web
[02:36] UI as a browser chat interface. We'll
[02:38] use the template, skip the default
[02:40] model, and pull Gemma 3 ourselves. Still
[02:43] saves us 5 minutes of setup. About the
[02:45] plan. There are four KVM tiers. I'm
[02:48] picking KVM 2, two vCPUs, 8 gigs of RAM,
[02:51] 100 GB NVMe. Gemma 3 4B in its 4-bit
[02:55] version uses about 3 to 5 gigs of RAM in
[02:57] practice. The model file is 3.3 gigs on
[03:00] disk plus context as it grows. OS and
[03:03] runtime take another 1 to 2. That leaves
[03:05] a gig or two of headroom on eight.
[03:07] Tight, but it's enough for a 4B model
[03:09] and a small app on top. If you want to
[03:11] step up to 7B models, that's KVM 4
[03:14] territory. 7B in 4-bit lands in the 5 to
[03:17] 7 gig range, which doesn't fit cleanly
[03:20] on eight. So, KVM 2 if you're staying
[03:22] with 3B or 4B, which is where we're
[03:24] going for this build. KVM 4 if you want
[03:27] to experiment with 7B.
[03:29] If you want to follow along, the coupon
[03:31] LTS10 takes 10% off whichever plan you
[03:34] pick. Link is hosting.your.com/lts10.
[03:38] Also in the description.
[03:40] After provisioning, you land in Hpanel.
[03:43] From the VPS section, head into OS and
[03:46] panel, click operating system, pick
[03:49] Ubuntu, then select applications, and
[03:53] pick Ollama, and the template will take
[03:55] care of VPS setup. H panel gives you the
[03:58] VPS overview, free weekly backups,
[04:02] firewall settings, a browser-based
[04:04] terminal, meaning no SSH key fiddling if
[04:07] you don't feel like it. One click and
[04:09] you're inside the box. That icon is
[04:12] Cody, their AI assistant. You can hand
[04:15] it a task in plain English and it'll
[04:17] actually run the VPS commands itself. No
[04:20] copy-paste needed. Worth knowing about,
[04:23] even if we won't use it today.
[04:26] Because the template ships with Ollama,
[04:28] we can just verify it's there. It prints
[04:30] the version, so we're good. If you're on
[04:32] a plain Ubuntu install instead, this
[04:34] one-liner will do the job. Same outcome,
[04:37] 60 seconds longer.
[04:44] First, let's see what's already there.
[04:47] The template came with Llama 3 already
[04:49] pulled.
[04:50] Fine model, but we're going with Gemma 3
[04:53] 4B for this build. Google's open-weight
[04:55] model. Smaller, fast on CPU, fits on our
[04:59] 8-gig box. The first pull is the slow
[05:01] part.
[05:02] About 3.3 gigs of the wire, I've sped
[05:05] this up. After it's local, it's local.
[05:08] You won't pull it again.
[05:10] Ollama list confirms what's installed.
[05:12] Both models sitting there ready to go.
[05:15] To actually talk to it,
[05:17] the first request takes a few seconds,
[05:19] the model has to load into RAM. After
[05:21] that, the next prompts are fast.
[05:23] There it is. The model we just pulled
[05:25] replying from our own server. Nothing
[05:28] external touched.
[05:29] Now, the cost. What's this taking in
[05:32] RAM?
[05:33] With the model loaded, we're using about
[05:35] 4 to 5 gigs of RAM, more than half our
[05:37] 8-gig box. Self-hosting isn't free.
[05:40] Instead of paying per token, you're
[05:42] paying a fixed monthly bill for the
[05:44] server. If your traffic is bursty small,
[05:47] hosted APIs are usually cheaper. If it's
[05:51] steady or you've got privacy
[05:52] constraints, self-hosting wins.
[05:55] Now, we can call it from code. Ollama's
[05:58] been serving a REST API on port 11343
[06:01] the whole time. Nothing else to start,
[06:03] nothing else to configure.
[06:05] We hit the generate endpoint with a
[06:07] simple prompt. Standard JSON, model
[06:09] name, generated text, token counts,
[06:12] timings, the usual fields you'd want.
[06:15] Ollama also exposes an OpenAI compatible
[06:17] endpoint at v1/chat/completions.
[06:21] If you have code using the OpenAI SDK,
[06:24] just point its base URL at your server.
[06:27] The basic chat completions endpoint
[06:29] works. One-line change and you've moved
[06:31] off OpenAI. Not every OpenAI endpoint is
[06:34] mirrored, but the everyday ones are
[06:36] there.
[06:37] If you ever open port 11343 to the
[06:40] public internet, add authentication.
[06:42] Ollama doesn't ship with off by default.
[06:45] The HPanel firewall section is where
[06:47] you'd lock things down. Whitelist your
[06:49] app's IP or put a reverse proxy in front
[06:52] with an API key.
[06:53] For this video, we're keeping everything
[06:55] on the same server. The app and Ollama
[06:57] will talk over localhost, so we don't
[06:59] need to expose the port at all.
[07:03] Time to build something.
[07:05] Keeping it deliberately simple, a chat
[07:07] UI that streams responses from our
[07:08] model. Node and Express on the back,
[07:11] plain HTML on the front, around 80 lines
[07:13] total, and it'll teach you the pattern
[07:15] you can use in anything bigger.
[07:17] Make the folder, init a package, install
[07:20] Express. Done.
[07:22] Now the server. Standard Express setup.
[07:25] We serve a static front end from a
[07:27] public folder and accept JSON bodies.
[07:30] Nothing fancy.
[07:31] One endpoint, post chat. It takes a
[07:34] prompt from the request body, forwards
[07:36] it to Ollama with stream true, and pipes
[07:39] the stream of tokens back to the browser
[07:42] as they arrive.
[07:43] Ollama uses newline delimited JSON for
[07:46] streaming. Each line is one chunk of the
[07:49] response.
[07:50] We buffer the chunks, split on new
[07:53] lines, parse each line, and write the
[07:55] response field back to the browser. The
[07:58] browser sees tokens appear in real time,
[08:00] same feel as ChatGPT.
[08:02] The buffer is required as TCP chunks
[08:05] don't always end on a JSON line. One
[08:08] line can get split across two chunks.
[08:11] Without it, you'd hit JSON.parse errors
[08:13] on partial lines, maybe one request in
[08:16] 15.
[08:17] Failures that look random until you spot
[08:19] the cause.
[08:22] The front end is very simple. Text area
[08:24] for the prompt, a div for the output, a
[08:27] button.
[08:28] The script does one thing. Fetch to our
[08:30] chat endpoint, read the response body as
[08:32] a stream, and append each piece to the
[08:35] output div as it arrives. That's the
[08:37] full client.
[08:39] Start the server, open the VPS IP on
[08:42] port 3000. I'll ask it for a short poem
[08:44] about a tired developer at 2:00 a.m. Hit
[08:46] send and tokens stream back in real
[08:48] time.
[08:49] The full loop. Your code, your server,
[08:52] your model. No external API in the
[08:55] picture, no token meter anymore.
[08:58] From here, you'd add chat history,
[09:01] system prompts, maybe RAG over your own
[09:03] documents. Same loop underneath, just
[09:06] more layers on top.
[09:09] Before anyone uninstalls the OpenAI SDK,
[09:12] let's talk about the tradeoffs. Pick the
[09:14] wrong tool for the job and you'll regret
[09:16] it.
[09:17] Self-hosting wins for learning, for
[09:19] privacy-sensitive work, for side
[09:21] projects and internal tools, for
[09:23] predictable costs at low to medium
[09:24] volume, for offline use, and for pinning
[09:27] a model version so the vendor can't
[09:29] change it on you mid-project.
[09:31] Hosted APIs still win for frontier
[09:33] capability, for high-throughput
[09:35] production, for the zero ops overhead
[09:38] and for the largest models you wouldn't
[09:40] realistically run yourself.
[09:42] A 4B open model handles plenty of jobs.
[09:45] The largest closed models still pull
[09:47] ahead on the hardest ones. Pick the
[09:49] right tool for the job you're trying to
[09:50] do.
[09:51] Different tools for different jobs. Most
[09:54] real systems end up using both. Hosted
[09:56] APIs for the hard reasoning, self-hosted
[09:59] for high volume traffic or anything
[10:01] privacy sensitive.
[10:03] That's it.
[10:05] Pull a model, run it, expose it on a
[10:07] port, point an app at it. Once you've
[10:10] done it once, you can do it for any open
[10:12] weight model. Llama, Mistral, Qwen,
[10:15] DeepSeek, whatever drops next week.
[10:18] If you want, spin up your own. The
[10:19] coupon LTS10 takes 10% off at
[10:22] hostingor.com. Links in the description.
[10:25] Thanks for watching. See you in the next
[10:26] one.
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.