---
title: 'Want to Run AI Agents Locally? Here is The Bare Minimum Setup/Build'
source: 'https://youtube.com/watch?v=P-Fmo_CCIbY'
video_id: 'P-Fmo_CCIbY'
date: 2026-07-28
duration_sec: 977
---

# Want to Run AI Agents Locally? Here is The Bare Minimum Setup/Build

> Source: [Want to Run AI Agents Locally? Here is The Bare Minimum Setup/Build](https://youtube.com/watch?v=P-Fmo_CCIbY)

## Summary

This video explains that VRAM (graphics card memory) is the most critical factor for running local AI agents, not raw GPU speed. Using a kitchen analogy, the speaker describes how model size and conversation history consume VRAM, and provides three hardware tiers for different budgets, along with software recommendations.

### Key Points

- **VRAM is the most important spec** [0:00] — VRAM (counter space) matters more than GPU speed (chef hand speed) for local AI. If the model doesn't fit in VRAM, performance drops from ~40 tokens/sec to 2-3 tokens/sec.
- **Model size and compression** [1:04] — A 7B model takes ~5GB at 4-bit compression, 14B ~10GB, 32B ~20GB, 70B ~40GB. Conversation history adds to VRAM usage like dirty dishes piling up.
- **Tier 1: Budget build ($1200-1500)** [3:48] — RTX 4060 Ti 16GB VRAM, Ryzen 5, 64GB RAM, 2TB SSD. Runs 7-8B models comfortably, can push 14B with trade-offs.
- **Tier 2: Sweet spot build** [6:48] — Two paths: RTX 4070 Ti Super 16GB (faster) or used RTX 3090 24GB (more VRAM). Runs 32B models well. Mac equivalent: Mac Mini M4 Pro 64GB unified memory.
- **Tier 3: High-end build** [9:16] — RTX 4090 24GB, Ryzen 9, 128GB RAM. Runs 32B models like butter, can experiment with 70B. Mac equivalent: Mac Studio M3 Ultra 96GB.
- **Software: Ollama and LM Studio** [11:59] — Ollama (CLI) and LM Studio (GUI) are the main tools. Model formats: GGUF/MLX for Mac, AWQ for Nvidia. Using the wrong format leaves speed on the table.
- **Local vs Cloud AI** [13:52] — Local AI is not a replacement for cloud frontier models (ChatGPT, Claude). Use local for privacy, cost control, uptime; cloud for heavy lifting. Hybrid setup is best.

### Conclusion

For local AI, prioritize VRAM over GPU speed. Start with a $1200-1500 build (Tier 1) and upgrade later. Use a hybrid approach: local for daily tasks, cloud for heavy lifting.

## Transcript

There's one number on the stats
of your graphics card that
matters more than
everything else combined for
local AI agents.
And it's not the one you think
most people build their AI
computer the same way.
They would build the gaming PC
faster processor,
bigger graphics
cards and just more power.
That's exactly why their setup
runs like a garbage.
In my previous business,
I used to overcharge people for
this stuff and made
me just sick of it.
So I had to quit.
Now I just show you how to
build it yourself for
completely free.
Let me explain this in a
simplest way I can.
We're going to stay in one
analogy for most of this video.
So stick with me.
Your local AI setup is like a
restaurant kitchen,
the graphics card.
That's the part of your
computer that does
all the heavy math.
Think of it as the chef, how
fast the chef's hands move,
how quickly they
can chop, stir, plate,
but the memory on the graphics
card called VRAM,
that's the size of
the kitchen counter.
Remember that. And here's the
thing nobody's
really talking about.
The counter size matters more
than hand speed.
Here's what actually happens
when you run AI locally.
The AI model is
basically a giant recipe.
When you see a model labeled
7B, that means 7 billion,
a seven with
nine zeros after it.
That's 7 billion tiny
instructions that
tell the AI how to think.
More instructions, smarter AI,
but also a bigger recipe that
takes up more and
more counter space.
That entire recipe needs to sit
on the counter
while the chef works.
If it's this, the chef works at
full speed, chopping, plating,
no wasted movement at all.
But the second the recipe is
just too big for the counter,
the chef has to keep running to
the back storage room
to grab ingredients.
That storage room, is your
computer's regular memory,
we call it REM instead of VRAM.
And it is just
way, way much slower.
We're talking about going from
a smooth 40 words per second
down to maybe two to three
words, which is just unusable.
Now you might be wondering how
do people fit
these massive recipes
on a normal counter?
They use shorthand.
Instead of writing every
instruction in full
detail handwriting,
they compress it down. We call
it a 4 bit compression,
which is the same recipe just
in a smaller notebook.
You lose maybe a
tiny bit of a detail,
but it's fits on way less
counter space at
four bit compression.
Here's a cheat sheet for you.
A seven billion instruction
model takes about five
gigs of counter space.
At 14 billion takes about 10,
32 billion takes about 20,
70 billion takes about 40.
That's just the
recipe sitting there.
The chef hasn't even
started cooking yet.
The moment you
start a conversation,
the conversation memory starts
growing right?
So think of it like dirty dishes
piling up on the
counter while the chef cooks
longer conversation. There's
obviously more
dishes piling up,
less room for the recipe.
That's why a model that loads
perfectly fine can
still slow to a crawl 20
minutes into a conversation.
The counter ran
out of that space.
So the question here isn't how
fast is my chef?
The real question is supposed
to be like, how
big is my counter?
The VRAM that one number VRAM
dictates almost
everything about your
local AI experience. Okay.
So now you know
the real bottleneck,
but here's where most
people still mess up.
They pick the right counter and
then just cheap out.
They decided to cheap out on
everything else in the kitchen,
or they just over buy because
some Reddit random
Reddit posts told them they
need a $5,000
setup. Neither is true.
Let me walk you through exactly
what to buy at
every budget level.
So this is where most people
should start and it's way more
affordable than you
may have thought. The core of
this build is just
one graphics card.
The RTX 4060 Ti
with 16 gigs of VRAM,
16 gigs of counter space, not
the eight gigs
version. That's the,
that's the wrong version.
That's the trap. Eight gigs,
one fills up the second you
load a real model
plus the conversation.
It's just the haywire, right?
So dishes start
piling up immediately.
You need at least 12 or 16.
Yeah, 12 or 16.
I'm running mine on, uh, I
think mine is 16.
Yes. Around that,
you build a simple desktop, a
and now Ryzen five processor.
That's the brain of the
computer for local
AI. It matters way,
way less than you think, but
you still need a
decent one though.
64 gigs of system RAM. That's
the backstories room.
You want it big enough that if
things spill off the counter,
there's somewhere for them to
go, you know, now
a two terabyte SSD.
That's your think of it like a
pantry where all your model
recipes are stored
before you pull
them onto the counter.
Now there should be a decent
power supply as well.
That's the electrical panel for
the whole kitchen and a case
with a good air flow
to keep it cool. So total cost
total damage is about 1200 to
$1,500. Obviously that's USD
though. What does
this actually run?
It could run seven and 8
billion instructions
model very comfortably.
That's your Qwen3 8 billion
parameters model that I covered
in the last video.
You may have watched it. So
your deepseek
distilled 7 billion,
your llama 8 billion. These
are, these are not toys.
These model handle real coding
assistant document
summaries, private chat,
and light, very light Asian
workflows as well.
You can push a 14 billion
instruction model on
this build as well,
but you'll feel
some kind of trade off.
So shorter conversation before
the dishes pilot,
maybe a slower output.
It's still usable, but you're
like bumping against the edge
of the counter,
the end of it, right? Now,
if you're already in the Apple
world deep into the ecosystem,
a MacBook pro or Mac mini with
16 gigs of unified
memory gets you into the
same exact tier.
Here's why Apple is a little
different from PC
on a regular PC.
The graphics card has its own
own counter and the
computer has a separate
storage room in the back room,
but on a Mac,
there's no storage room.
It's just all one
big freaking counter.
The graphics and the main
computer share the
same pool of memory.
That's what unified means.
So 16 gigs on a Mac is all
usable counter space,
but there's a trade off max at
this level are a
little bit slower on raw
speed compared to the dedicated
Nvidia graphics
card, but you know,
simplicity is just too
difficult to beat, Ollama one
download and just your
running models and minutes.
This is the sweet spot for
anyone doing serious local
work, coding agents,
document analysis, multi-step
AI workflows. There
are two paths here,
I think. So path one, you can
grab an RTX 4070 TI,
super with 16 gigs
of counter space,
which is much faster chef hands
than the 4060 TI
more headroom better for
agent style loops where the
model is thinking
properly executing
and kind of thinking again. And
there's other paths,
which I think this is the move
a lot of experienced local AI
user power user
people make. You can buy a used
RTX 3090 way for it.
It's an older car obviously,
but like I told you,
when it comes to local AI, the
graphics card doesn't really
matter. It's all V
run. It's all V run.
So it has 24 gigs of VRAM in
this older graphics card.
So 24 gigs of counter space is
just a freaking
different world at 24.
You can run 32 billion
instructions model and
shorthand and still have room
left over for a
long conversation.
So models like quant three 32 B
or deep seek R1 distilled
or deep seek R1 distilled
32 B and there are new
models. I haven't
tested those yet,
but these are the models that
but these are the models that
start rivaling cloud quality
for most everyday
use cases like task. Rest of
use cases.Rest of
the build stays kind
of similar, you know,
Ryzen seven processors, 64 gigs
of RAM to terabyte SSD,
bigger the better. But you
know, on the Mac side though,
a Mac mini M4 pro with 64 gigs
of unified memory
lands you here too.
That big share counter means
you can load a 32 billion
instructions model this
and still have reading room for
it. Speed is slower
than the Nvidia cards
but how slow are we talking
about like a 10 to 11,
12 words per second. We call it
tokens per second
is the official term.
And it basically means how many
words the AI spits
out each seconds.
You know, like when you type it
in the GPT and it
generates the output,
the speed of the outputs being
generated that's
tokens per second.
And usually 10 to 15 feels like
a person typing very fast.
30 plus feels like very
instant. So 11 to 12 is,
you're at a comfortable range,
not blazing fast, but you know,
comfortable and the Mac runs
quietly sips power and just
works smooth like
butter. Now onto the next one.
butter. Now onto the next one.
This is only worth it.
If your workflow
genuinely demands it,
don't buy this because it
sounds cool. You gotta, you
gotta take this seriously.
The centerpiece is RTX 40 90
with 24 gigs of counter space
paired with a Ryzen
nine processor, 128
gigs of system RAM,
which is a massive storage room
for overflow, right?
And a beefy power supply. This
runs 32 billion
instructions models,
like butter, just fast chef,
big counter,
long conversations,
just complex Asian chains.
You can probably also
experiment with 70 billion
instructions model at heavy
compression, but something that
I haven't done it by myself.
So I can't really tell you like
if it's actually
runnable right?
So, but I heard it's working,
but it's likely you're covering
the entire counter
with recipe pages.
Like you can expect trade-offs
on conversation
lengths because there's just
barely room for the dishes.
Now the Apple equivalent is the
max studio with an
M3 ultra chip and 96
gigs of unified
memory, 96 gigs of counter.
This thing loads multiple
recipes at once.
A reasoning model on embedding
model, a coding model,
all sitting on the counter is
simultaneously and it idles at
under a hundred
Watts. The 40 90 desktop will
draw five to 10
times that under load.
One more thing.
You'll see people talking about
RTX 15 90 builds with 32 gigs
of VRAM and dual
GPU set up pushing 64 gigs
total. That stuff exists,
but we're talking like 10 K
setup 10 K plus.
So that's like,
I say that's the highest end of
consumer grade
graphics, graphics card,
the end the boss level.
This is for like people who
actually want to train their
own AI model to do
something with it.
Like I don't know if you guys
seen a recent PewDiePie video
where he trained his
own AI models for six plus
months to cross the,
I forgot the name of the
benchmarks, but he did, which
was pretty interesting.
Anyway, quick note on a
Raspberry Pi. I love the pie.
I don't use it anymore, but I
made videos
about it in the past,
but it's not a
local AI daily driver.
A PI five is great for like
edge experiments,
like running OpenClaw for
sandbox agent execution,
computer vision projects. But
if you're trying to
run a real language,
large language model
for chat or coding,
you need one of the three tiers
that I mentioned above instead
of the Raspberry
Pi. The PI is the garage
workshop for small projects,
but the tiers above are for the
actual kitchen.
If hardware is
half the equation,
here's the software side, and
I'm going to keep
this very tight.
So for getting models running
two options dominate right now.
Olama. This is one that I use
as a command line tool.
So it's a little high learning
cup, but it's dead simple. You
type one command,
the model downloads and loads
onto your computer and it's
just running there.
Works on Mac, Windows and
Linux. You can also download it
off of their website.
LM Studio, which is the next
one, the same idea, but with a
visual interface,
what I mean by visual interface
is like a chat GPT. There's a
chat window there.
So if the command line
interface kind of
make you nervous,
you can start here LM Studio.
Both of these handle model
downloading graphics card
detection and serving the
model locally. So your others
tools can talk to it.
But as far as I remember, these
two have different size of
context window.
Now here's a detail that'll
save you real frustration here.
Model files comes in different
packaging formats.
Think of it like how the same
movie can be a
DVD or if you know,
Blu-ray, same content,
different packaging optimized
for just different players.
There's GGUF, it's Guff. It's
the format that
plays best on Macs.
So AWQ is the one built for Nvidia
graphics card. If
you're on a Mac,
just grab GGUF or MLX. If you're on
Windows using Linux server with an Nvidia card,
you can look into AWQ because
what I heard that there was a
test that showed
that it gave the faster
response time and better
quality output compared to
GGUF on the same card. So most
people don't know this.
They just grab whatever model
files has the most downloads.
And if they're using the wrong
format for their machine,
they're kind of leaving speed
on the table there. So for
agent workflows,
you can plug Ollama into tools
like N8N for automation,
crew AI for multi-agent setup
or build custom pipelines.
But that's a whole separate
video I'll cover later.
I want to be straight with you
because I think most
AI content online isn't.
Local AI agent is not a
replacement for cloud AI,
closed frontier models. I'm
talking about chat GPT,
Claude, Gemini, just not yet.
Maybe not even, not
ever for everything.
You know, the biggest, most
powerful reasoning models to
still live in the cloud.
They're all U.S made. When I
need frontier level
thinking on a complex
problem, I use Claude or GPT.
Just that's the,
that's the truth.
But I now have a very high
expectation of a new
deep seek 4
coming. I don't know when.
But hopefully it doesn't crash
the stock market
this time. Anyway,
here's what local does better
than anything in the cloud.
That's just the privacy side,
right? Your data
never leaves your machine.
No terms of service, no
training on your prompts,
no training on your secrets,
your private life.
No API logs, cost control, no
surprise bills,
which is very important.
No token meters running.
You just pay once for the
hardware and every conversation
after that is just
free. Up time, maybe your
internet goes down.
Your local model does not care.
It just keeps
working. That's how it works.
It just lives in your device
once downloaded.
Think of it like this.
Local AI is your home gym and
cloud AI is like a commercial
gym in downtown. Your home gym
can handle maybe
80% of your workflows.
It's always open, always
private. You never
wait for a machine.
But once in a while you need
the heavy equipment downtime.
You might want to do some
deadlifts. That's
fine. You use both.
The smartest setup in 2026 and
probably beyond
is just a hybrid.
Local for the daily work cloud
through the heavy lifting and
the builds I just
showed you are exactly where
that local side starts.
So I know you love summary. So
here's the summary.
Buy the counter
space, not hand speed.
VRAM is the number that you
need to focus on.
Everything else is just
secondary budget, the whole
kitchen, not just the chef.
If you're new, you can start at
tier one, 1200
to $1,500 builds.
It's real, it's capable, and
you can always upgrade the
graphics card later.
Too easy, right? So if this
helped you figure
out your builds,
why don't you drop a comment
with a tier that
you're going for?
I read every single comment.
And if you want the full part
with links inside of it, I'll
pin it in the comments.
Thanks for watching. Bye now.