Run Your Own LLM on a Server - Ollama + Gemma 3

Transcribed Jun 18, 2026 Watch on YouTube ↗

Intermediate 5 min read For: Developers and tech enthusiasts with basic server administration knowledge who want to self-host LLMs.

3.6K

Views

151

Likes

9

Comments

14

Dislikes

4.5%

🔥 High Engagement

AI Summary

This video provides a step-by-step tutorial on self-hosting a large language model (LLM) using Ollama and Gemma 3 on a VPS. It covers the reasons for self-hosting, the distinction between Ollama and models, server selection, setup, API exposure, and building a simple chat application. The tutorial also discusses tradeoffs between self-hosting and hosted APIs.

Chapters

1 Introduction and Motivation 0:06 2 Ollama vs. Models and Server Requirements 0:38 3 Choosing a VPS and Plan 1:25 4 Provisioning and Setting Up the Server 3:40 5 Interacting with the Model via API 5:54 6 Building a Chat Application 7:02 7 Tradeoffs and Conclusion 9:09

[0:24]

Reasons for Self-Hosting

Privacy, predictable cost, working on private networks, and version control are the four main reasons to self-host an LLM.

[0:38]

Ollama vs. Models

Ollama is the runtime that loads and runs models; models like Gemma, Llama, Mistral are the neural networks.

[1:08]

Server vs. Laptop

A server with enough RAM (e.g., 8GB for 4B models) and always-on connectivity is recommended over a laptop.

[1:25]

VPS Selection

Hostinger VPS with full root access, NVMe SSD, AMD EPYC processors, and a one-click LLM template is used.

[2:48]

Plan Tiers

KVM 2 (2 vCPUs, 8GB RAM) for 3B/4B models; KVM 4 for 7B models.

[4:44]

Pulling Gemma 3

The template preloads Ollama; we pull Gemma 3 4B instead of the default Llama 3.

[5:55]

API Access

Ollama serves a REST API on port 11434 and an OpenAI-compatible endpoint at /v1/chat/completions.

[6:37]

Security

Add authentication (IP whitelisting or reverse proxy) if exposing the API publicly.

[7:03]

Building a Chat App

A Node.js/Express backend streams tokens from Ollama to a plain HTML frontend in real time.

[9:09]

Tradeoffs

Self-hosting wins for learning, privacy, side projects, and predictable costs; hosted APIs win for frontier models and high throughput.

Clickbait Check

95% Legit

"The title accurately reflects the content: the video delivers a complete tutorial on running an LLM on a server, including setup, API exposure, and building a chat app."

Mentioned in this Video

Ollama

tool

Gemma 3

model

Llama 3

model

Mistral

model

Qwen

model

DeepSeek

model

Hostinger VPS

service

Open Web UI

tool

Node.js

tool

Express

tool

Tutorial Checklist

1 1:25 Provision a VPS with at least 8GB RAM (e.g., Hostinger KVM 2).

2 3:40 Select the Ollama template during OS installation (Ubuntu + Ollama).

3 4:26 Verify Ollama is installed by running 'ollama --version'.

4 4:44 Pull the desired model: 'ollama pull gemma3:4b'.

5 5:15 Test the model via CLI: 'ollama run gemma3:4b'.

6 5:55 Access the model via API: POST to http://<server-ip>:11434/api/generate with JSON body.

7 7:17 Create a Node.js project: 'mkdir chat-app && cd chat-app && npm init -y && npm install express'.

8 7:22 Create server.js with Express, serving static files and a /chat endpoint that streams from Ollama.

9 8:22 Create a public/index.html with a textarea, button, and output div, and client-side JavaScript to fetch and stream the response.

10 8:39 Start the server: 'node server.js' and access it at http://<server-ip>:3000.

Study Flashcards (9)

What is Ollama?

easy Click to reveal answer

Ollama is the runtime that loads a model into memory and handles requests.