[0:06] Can you actually run a real LLM
[0:09] yourself? No OpenAI key, no Anthropic
[0:12] key, no per token bill, just a model
[0:14] running on a server you control. Yes,
[0:17] and in the next few minutes we're going
[0:19] to set one up, expose it as an API, and
[0:21] build a small chat app that talks to it.
[0:24] There are four real reasons people do
[0:25] this: privacy, predictable cost, working
[0:28] on private networks, and version
[0:30] control. We'll also talk about when
[0:32] self-hosting is the wrong call, because
[0:34] there are plenty of those cases, and
[0:36] pretending otherwise wastes your time.
[0:38] Quick distinction before we touch
[0:39] anything. These two things get mixed up
[0:41] constantly. Ollama is the runtime. It's
[0:45] the program that loads a model into
[0:46] memory and handles requests. Think of it
[0:49] like a JVM. By itself, it doesn't do
[0:51] anything useful, but it knows how to run
[0:53] things.
[0:54] Gemma, Llama, Mistral, QN, those are the
[0:58] models, the actual neural networks,
[1:00] multi-gigabyte files of weights that
[1:02] produce text. You can swap models
[1:05] anytime. Pull a new one, point Ollama at
[1:07] it done.
[1:08] Could you do this on your laptop? Sure.
[1:11] Should you? Probably not. Laptops sleep,
[1:14] models eat RAM, and you don't want a 4
[1:16] billion parameter model warming your CPU
[1:18] mid-call. So, a server, always on,
[1:21] reachable with enough RAM to hold a
[1:23] model without choking.
[1:25] So, for this tutorial, I'll be using a
[1:27] Hostinger VPS, and I want to walk
[1:30] through why, because the plan choice is
[1:32] important here. This isn't a generic any
[1:34] server works situation. Memory and CPU
[1:37] matter here, more than for a regular web
[1:40] app.
[1:40] This is Hostinger's VPS hosting page,
[1:43] their AI managed VPS line.
[1:46] Before we continue, a few things are
[1:47] worth calling out.
[1:49] Full root access, because Ollama runs as
[1:51] a system service, we need to install it,
[1:54] expose ports, and adjust firewall rules.
[1:57] Locked-down hosts won't let you do that.
[1:59] NVMe SSD storage, that's relevant
[2:02] because model files are big. Gemma 3 4B
[2:05] is roughly 3.3 gigs on disk. A 7B model
[2:08] is around 4 to 5. Loading from a slow
[2:11] disk means slow cold starts. AMD EPYC
[2:14] processors, KVM virtualization. KVM
[2:18] gives you a real virtual machine with
[2:20] allocated CPU and RAM. Stronger
[2:22] isolation than the kind of shared
[2:24] container hosting where someone else's
[2:26] workload can steal your CPU. The
[2:28] one-click LLM template, specifically an
[2:31] Ubuntu image preloaded with Ollama, the
[2:33] Llama 3 model as a default, and Open Web
[2:36] UI as a browser chat interface. We'll
[2:38] use the template, skip the default
[2:40] model, and pull Gemma 3 ourselves. Still
[2:43] saves us 5 minutes of setup. About the
[2:45] plan. There are four KVM tiers. I'm
[2:48] picking KVM 2, two vCPUs, 8 gigs of RAM,
[2:51] 100 GB NVMe. Gemma 3 4B in its 4-bit
[2:55] version uses about 3 to 5 gigs of RAM in
[2:57] practice. The model file is 3.3 gigs on
[3:00] disk plus context as it grows. OS and
[3:03] runtime take another 1 to 2. That leaves
[3:05] a gig or two of headroom on eight.
[3:07] Tight, but it's enough for a 4B model
[3:09] and a small app on top. If you want to
[3:11] step up to 7B models, that's KVM 4
[3:14] territory. 7B in 4-bit lands in the 5 to
[3:17] 7 gig range, which doesn't fit cleanly
[3:20] on eight. So, KVM 2 if you're staying
[3:22] with 3B or 4B, which is where we're
[3:24] going for this build. KVM 4 if you want
[3:27] to experiment with 7B.
[3:29] If you want to follow along, the coupon
[3:31] LTS10 takes 10% off whichever plan you
[3:34] pick. Link is hosting.your.com/lts10.
[3:38] Also in the description.
[3:40] After provisioning, you land in Hpanel.
[3:43] From the VPS section, head into OS and
[3:46] panel, click operating system, pick
[3:49] Ubuntu, then select applications, and
[3:53] pick Ollama, and the template will take
[3:55] care of VPS setup. H panel gives you the
[3:58] VPS overview, free weekly backups,
[4:02] firewall settings, a browser-based
[4:04] terminal, meaning no SSH key fiddling if
[4:07] you don't feel like it. One click and
[4:09] you're inside the box. That icon is
[4:12] Cody, their AI assistant. You can hand
[4:15] it a task in plain English and it'll
[4:17] actually run the VPS commands itself. No
[4:20] copy-paste needed. Worth knowing about,
[4:23] even if we won't use it today.
[4:26] Because the template ships with Ollama,
[4:28] we can just verify it's there. It prints
[4:30] the version, so we're good. If you're on
[4:32] a plain Ubuntu install instead, this
[4:34] one-liner will do the job. Same outcome,
[4:37] 60 seconds longer.
[4:44] First, let's see what's already there.
[4:47] The template came with Llama 3 already
[4:49] pulled.
[4:50] Fine model, but we're going with Gemma 3
[4:53] 4B for this build. Google's open-weight
[4:55] model. Smaller, fast on CPU, fits on our
[4:59] 8-gig box. The first pull is the slow
[5:01] part.
[5:02] About 3.3 gigs of the wire, I've sped
[5:05] this up. After it's local, it's local.
[5:08] You won't pull it again.
[5:10] Ollama list confirms what's installed.
[5:12] Both models sitting there ready to go.
[5:15] To actually talk to it,
[5:17] the first request takes a few seconds,
[5:19] the model has to load into RAM. After
[5:21] that, the next prompts are fast.
[5:23] There it is. The model we just pulled
[5:25] replying from our own server. Nothing
[5:28] external touched.
[5:29] Now, the cost. What's this taking in
[5:32] RAM?
[5:33] With the model loaded, we're using about
[5:35] 4 to 5 gigs of RAM, more than half our
[5:37] 8-gig box. Self-hosting isn't free.
[5:40] Instead of paying per token, you're
[5:42] paying a fixed monthly bill for the
[5:44] server. If your traffic is bursty small,
[5:47] hosted APIs are usually cheaper. If it's
[5:51] steady or you've got privacy
[5:52] constraints, self-hosting wins.
[5:55] Now, we can call it from code. Ollama's
[5:58] been serving a REST API on port 11343
[6:01] the whole time. Nothing else to start,
[6:03] nothing else to configure.
[6:05] We hit the generate endpoint with a
[6:07] simple prompt. Standard JSON, model
[6:09] name, generated text, token counts,
[6:12] timings, the usual fields you'd want.
[6:15] Ollama also exposes an OpenAI compatible
[6:17] endpoint at v1/chat/completions.
[6:21] If you have code using the OpenAI SDK,
[6:24] just point its base URL at your server.
[6:27] The basic chat completions endpoint
[6:29] works. One-line change and you've moved
[6:31] off OpenAI. Not every OpenAI endpoint is
[6:34] mirrored, but the everyday ones are
[6:36] there.
[6:37] If you ever open port 11343 to the
[6:40] public internet, add authentication.
[6:42] Ollama doesn't ship with off by default.
[6:45] The HPanel firewall section is where
[6:47] you'd lock things down. Whitelist your
[6:49] app's IP or put a reverse proxy in front
[6:52] with an API key.
[6:53] For this video, we're keeping everything
[6:55] on the same server. The app and Ollama
[6:57] will talk over localhost, so we don't
[6:59] need to expose the port at all.
[7:03] Time to build something.
[7:05] Keeping it deliberately simple, a chat
[7:07] UI that streams responses from our
[7:08] model. Node and Express on the back,
[7:11] plain HTML on the front, around 80 lines
[7:13] total, and it'll teach you the pattern
[7:15] you can use in anything bigger.
[7:17] Make the folder, init a package, install
[7:20] Express. Done.
[7:22] Now the server. Standard Express setup.
[7:25] We serve a static front end from a
[7:27] public folder and accept JSON bodies.
[7:30] Nothing fancy.
[7:31] One endpoint, post chat. It takes a
[7:34] prompt from the request body, forwards
[7:36] it to Ollama with stream true, and pipes
[7:39] the stream of tokens back to the browser
[7:42] as they arrive.
[7:43] Ollama uses newline delimited JSON for
[7:46] streaming. Each line is one chunk of the
[7:49] response.
[7:50] We buffer the chunks, split on new
[7:53] lines, parse each line, and write the
[7:55] response field back to the browser. The
[7:58] browser sees tokens appear in real time,
[8:00] same feel as ChatGPT.
[8:02] The buffer is required as TCP chunks
[8:05] don't always end on a JSON line. One
[8:08] line can get split across two chunks.
[8:11] Without it, you'd hit JSON.parse errors
[8:13] on partial lines, maybe one request in
[8:16] 15.
[8:17] Failures that look random until you spot
[8:19] the cause.
[8:22] The front end is very simple. Text area
[8:24] for the prompt, a div for the output, a
[8:27] button.
[8:28] The script does one thing. Fetch to our
[8:30] chat endpoint, read the response body as
[8:32] a stream, and append each piece to the
[8:35] output div as it arrives. That's the
[8:37] full client.
[8:39] Start the server, open the VPS IP on
[8:42] port 3000. I'll ask it for a short poem
[8:44] about a tired developer at 2:00 a.m. Hit
[8:46] send and tokens stream back in real
[8:48] time.
[8:49] The full loop. Your code, your server,
[8:52] your model. No external API in the
[8:55] picture, no token meter anymore.
[8:58] From here, you'd add chat history,
[9:01] system prompts, maybe RAG over your own
[9:03] documents. Same loop underneath, just
[9:06] more layers on top.
[9:09] Before anyone uninstalls the OpenAI SDK,
[9:12] let's talk about the tradeoffs. Pick the
[9:14] wrong tool for the job and you'll regret
[9:16] it.
[9:17] Self-hosting wins for learning, for
[9:19] privacy-sensitive work, for side
[9:21] projects and internal tools, for
[9:23] predictable costs at low to medium
[9:24] volume, for offline use, and for pinning
[9:27] a model version so the vendor can't
[9:29] change it on you mid-project.
[9:31] Hosted APIs still win for frontier
[9:33] capability, for high-throughput
[9:35] production, for the zero ops overhead
[9:38] and for the largest models you wouldn't
[9:40] realistically run yourself.
[9:42] A 4B open model handles plenty of jobs.
[9:45] The largest closed models still pull
[9:47] ahead on the hardest ones. Pick the
[9:49] right tool for the job you're trying to
[9:50] do.
[9:51] Different tools for different jobs. Most
[9:54] real systems end up using both. Hosted
[9:56] APIs for the hard reasoning, self-hosted
[9:59] for high volume traffic or anything
[10:01] privacy sensitive.
[10:03] That's it.
[10:05] Pull a model, run it, expose it on a
[10:07] port, point an app at it. Once you've
[10:10] done it once, you can do it for any open
[10:12] weight model. Llama, Mistral, Qwen,
[10:15] DeepSeek, whatever drops next week.
[10:18] If you want, spin up your own. The
[10:19] coupon LTS10 takes 10% off at
[10:22] hostingor.com. Links in the description.
[10:25] Thanks for watching. See you in the next
[10:26] one.