[0:06] Can you actually run a real LLM [0:09] yourself? No OpenAI key, no Anthropic [0:12] key, no per token bill, just a model [0:14] running on a server you control. Yes, [0:17] and in the next few minutes we're going [0:19] to set one up, expose it as an API, and [0:21] build a small chat app that talks to it. [0:24] There are four real reasons people do [0:25] this: privacy, predictable cost, working [0:28] on private networks, and version [0:30] control. We'll also talk about when [0:32] self-hosting is the wrong call, because [0:34] there are plenty of those cases, and [0:36] pretending otherwise wastes your time. [0:38] Quick distinction before we touch [0:39] anything. These two things get mixed up [0:41] constantly. Ollama is the runtime. It's [0:45] the program that loads a model into [0:46] memory and handles requests. Think of it [0:49] like a JVM. By itself, it doesn't do [0:51] anything useful, but it knows how to run [0:53] things. [0:54] Gemma, Llama, Mistral, QN, those are the [0:58] models, the actual neural networks, [1:00] multi-gigabyte files of weights that [1:02] produce text. You can swap models [1:05] anytime. Pull a new one, point Ollama at [1:07] it done. [1:08] Could you do this on your laptop? Sure. [1:11] Should you? Probably not. Laptops sleep, [1:14] models eat RAM, and you don't want a 4 [1:16] billion parameter model warming your CPU [1:18] mid-call. So, a server, always on, [1:21] reachable with enough RAM to hold a [1:23] model without choking. [1:25] So, for this tutorial, I'll be using a [1:27] Hostinger VPS, and I want to walk [1:30] through why, because the plan choice is [1:32] important here. This isn't a generic any [1:34] server works situation. Memory and CPU [1:37] matter here, more than for a regular web [1:40] app. [1:40] This is Hostinger's VPS hosting page, [1:43] their AI managed VPS line. [1:46] Before we continue, a few things are [1:47] worth calling out. [1:49] Full root access, because Ollama runs as [1:51] a system service, we need to install it, [1:54] expose ports, and adjust firewall rules. [1:57] Locked-down hosts won't let you do that. [1:59] NVMe SSD storage, that's relevant [2:02] because model files are big. Gemma 3 4B [2:05] is roughly 3.3 gigs on disk. A 7B model [2:08] is around 4 to 5. Loading from a slow [2:11] disk means slow cold starts. AMD EPYC [2:14] processors, KVM virtualization. KVM [2:18] gives you a real virtual machine with [2:20] allocated CPU and RAM. Stronger [2:22] isolation than the kind of shared [2:24] container hosting where someone else's [2:26] workload can steal your CPU. The [2:28] one-click LLM template, specifically an [2:31] Ubuntu image preloaded with Ollama, the [2:33] Llama 3 model as a default, and Open Web [2:36] UI as a browser chat interface. We'll [2:38] use the template, skip the default [2:40] model, and pull Gemma 3 ourselves. Still [2:43] saves us 5 minutes of setup. About the [2:45] plan. There are four KVM tiers. I'm [2:48] picking KVM 2, two vCPUs, 8 gigs of RAM, [2:51] 100 GB NVMe. Gemma 3 4B in its 4-bit [2:55] version uses about 3 to 5 gigs of RAM in [2:57] practice. The model file is 3.3 gigs on [3:00] disk plus context as it grows. OS and [3:03] runtime take another 1 to 2. That leaves [3:05] a gig or two of headroom on eight. [3:07] Tight, but it's enough for a 4B model [3:09] and a small app on top. If you want to [3:11] step up to 7B models, that's KVM 4 [3:14] territory. 7B in 4-bit lands in the 5 to [3:17] 7 gig range, which doesn't fit cleanly [3:20] on eight. So, KVM 2 if you're staying [3:22] with 3B or 4B, which is where we're [3:24] going for this build. KVM 4 if you want [3:27] to experiment with 7B. [3:29] If you want to follow along, the coupon [3:31] LTS10 takes 10% off whichever plan you [3:34] pick. Link is hosting.your.com/lts10. [3:38] Also in the description. [3:40] After provisioning, you land in Hpanel. [3:43] From the VPS section, head into OS and [3:46] panel, click operating system, pick [3:49] Ubuntu, then select applications, and [3:53] pick Ollama, and the template will take [3:55] care of VPS setup. H panel gives you the [3:58] VPS overview, free weekly backups, [4:02] firewall settings, a browser-based [4:04] terminal, meaning no SSH key fiddling if [4:07] you don't feel like it. One click and [4:09] you're inside the box. That icon is [4:12] Cody, their AI assistant. You can hand [4:15] it a task in plain English and it'll [4:17] actually run the VPS commands itself. No [4:20] copy-paste needed. Worth knowing about, [4:23] even if we won't use it today. [4:26] Because the template ships with Ollama, [4:28] we can just verify it's there. It prints [4:30] the version, so we're good. If you're on [4:32] a plain Ubuntu install instead, this [4:34] one-liner will do the job. Same outcome, [4:37] 60 seconds longer. [4:44] First, let's see what's already there. [4:47] The template came with Llama 3 already [4:49] pulled. [4:50] Fine model, but we're going with Gemma 3 [4:53] 4B for this build. Google's open-weight [4:55] model. Smaller, fast on CPU, fits on our [4:59] 8-gig box. The first pull is the slow [5:01] part. [5:02] About 3.3 gigs of the wire, I've sped [5:05] this up. After it's local, it's local. [5:08] You won't pull it again. [5:10] Ollama list confirms what's installed. [5:12] Both models sitting there ready to go. [5:15] To actually talk to it, [5:17] the first request takes a few seconds, [5:19] the model has to load into RAM. After [5:21] that, the next prompts are fast. [5:23] There it is. The model we just pulled [5:25] replying from our own server. Nothing [5:28] external touched. [5:29] Now, the cost. What's this taking in [5:32] RAM? [5:33] With the model loaded, we're using about [5:35] 4 to 5 gigs of RAM, more than half our [5:37] 8-gig box. Self-hosting isn't free. [5:40] Instead of paying per token, you're [5:42] paying a fixed monthly bill for the [5:44] server. If your traffic is bursty small, [5:47] hosted APIs are usually cheaper. If it's [5:51] steady or you've got privacy [5:52] constraints, self-hosting wins. [5:55] Now, we can call it from code. Ollama's [5:58] been serving a REST API on port 11343 [6:01] the whole time. Nothing else to start, [6:03] nothing else to configure. [6:05] We hit the generate endpoint with a [6:07] simple prompt. Standard JSON, model [6:09] name, generated text, token counts, [6:12] timings, the usual fields you'd want. [6:15] Ollama also exposes an OpenAI compatible [6:17] endpoint at v1/chat/completions. [6:21] If you have code using the OpenAI SDK, [6:24] just point its base URL at your server. [6:27] The basic chat completions endpoint [6:29] works. One-line change and you've moved [6:31] off OpenAI. Not every OpenAI endpoint is [6:34] mirrored, but the everyday ones are [6:36] there. [6:37] If you ever open port 11343 to the [6:40] public internet, add authentication. [6:42] Ollama doesn't ship with off by default. [6:45] The HPanel firewall section is where [6:47] you'd lock things down. Whitelist your [6:49] app's IP or put a reverse proxy in front [6:52] with an API key. [6:53] For this video, we're keeping everything [6:55] on the same server. The app and Ollama [6:57] will talk over localhost, so we don't [6:59] need to expose the port at all. [7:03] Time to build something. [7:05] Keeping it deliberately simple, a chat [7:07] UI that streams responses from our [7:08] model. Node and Express on the back, [7:11] plain HTML on the front, around 80 lines [7:13] total, and it'll teach you the pattern [7:15] you can use in anything bigger. [7:17] Make the folder, init a package, install [7:20] Express. Done. [7:22] Now the server. Standard Express setup. [7:25] We serve a static front end from a [7:27] public folder and accept JSON bodies. [7:30] Nothing fancy. [7:31] One endpoint, post chat. It takes a [7:34] prompt from the request body, forwards [7:36] it to Ollama with stream true, and pipes [7:39] the stream of tokens back to the browser [7:42] as they arrive. [7:43] Ollama uses newline delimited JSON for [7:46] streaming. Each line is one chunk of the [7:49] response. [7:50] We buffer the chunks, split on new [7:53] lines, parse each line, and write the [7:55] response field back to the browser. The [7:58] browser sees tokens appear in real time, [8:00] same feel as ChatGPT. [8:02] The buffer is required as TCP chunks [8:05] don't always end on a JSON line. One [8:08] line can get split across two chunks. [8:11] Without it, you'd hit JSON.parse errors [8:13] on partial lines, maybe one request in [8:16] 15. [8:17] Failures that look random until you spot [8:19] the cause. [8:22] The front end is very simple. Text area [8:24] for the prompt, a div for the output, a [8:27] button. [8:28] The script does one thing. Fetch to our [8:30] chat endpoint, read the response body as [8:32] a stream, and append each piece to the [8:35] output div as it arrives. That's the [8:37] full client. [8:39] Start the server, open the VPS IP on [8:42] port 3000. I'll ask it for a short poem [8:44] about a tired developer at 2:00 a.m. Hit [8:46] send and tokens stream back in real [8:48] time. [8:49] The full loop. Your code, your server, [8:52] your model. No external API in the [8:55] picture, no token meter anymore. [8:58] From here, you'd add chat history, [9:01] system prompts, maybe RAG over your own [9:03] documents. Same loop underneath, just [9:06] more layers on top. [9:09] Before anyone uninstalls the OpenAI SDK, [9:12] let's talk about the tradeoffs. Pick the [9:14] wrong tool for the job and you'll regret [9:16] it. [9:17] Self-hosting wins for learning, for [9:19] privacy-sensitive work, for side [9:21] projects and internal tools, for [9:23] predictable costs at low to medium [9:24] volume, for offline use, and for pinning [9:27] a model version so the vendor can't [9:29] change it on you mid-project. [9:31] Hosted APIs still win for frontier [9:33] capability, for high-throughput [9:35] production, for the zero ops overhead [9:38] and for the largest models you wouldn't [9:40] realistically run yourself. [9:42] A 4B open model handles plenty of jobs. [9:45] The largest closed models still pull [9:47] ahead on the hardest ones. Pick the [9:49] right tool for the job you're trying to [9:50] do. [9:51] Different tools for different jobs. Most [9:54] real systems end up using both. Hosted [9:56] APIs for the hard reasoning, self-hosted [9:59] for high volume traffic or anything [10:01] privacy sensitive. [10:03] That's it. [10:05] Pull a model, run it, expose it on a [10:07] port, point an app at it. Once you've [10:10] done it once, you can do it for any open [10:12] weight model. Llama, Mistral, Qwen, [10:15] DeepSeek, whatever drops next week. [10:18] If you want, spin up your own. The [10:19] coupon LTS10 takes 10% off at [10:22] hostingor.com. Links in the description. [10:25] Thanks for watching. See you in the next [10:26] one.