[0:00] There's one number on the stats
[0:01] of your graphics card that
[0:03] matters more than
[0:04] everything else combined for
[0:06] local AI agents.
[0:07] And it's not the one you think
[0:08] most people build their AI
[0:10] computer the same way.
[0:10] They would build the gaming PC
[0:12] faster processor,
[0:14] bigger graphics
[0:15] cards and just more power.
[0:16] That's exactly why their setup
[0:17] runs like a garbage.
[0:18] In my previous business,
[0:19] I used to overcharge people for
[0:21] this stuff and made
[0:22] me just sick of it.
[0:23] So I had to quit.
[0:24] Now I just show you how to
[0:25] build it yourself for
[0:27] completely free.
[0:29] Let me explain this in a
[0:30] simplest way I can.
[0:32] We're going to stay in one
[0:33] analogy for most of this video.
[0:35] So stick with me.
[0:36] Your local AI setup is like a
[0:39] restaurant kitchen,
[0:40] the graphics card.
[0:41] That's the part of your
[0:42] computer that does
[0:42] all the heavy math.
[0:44] Think of it as the chef, how
[0:45] fast the chef's hands move,
[0:47] how quickly they
[0:47] can chop, stir, plate,
[0:49] but the memory on the graphics
[0:50] card called VRAM,
[0:52] that's the size of
[0:53] the kitchen counter.
[0:54] Remember that. And here's the
[0:55] thing nobody's
[0:56] really talking about.
[0:57] The counter size matters more
[0:59] than hand speed.
[1:01] Here's what actually happens
[1:02] when you run AI locally.
[1:04] The AI model is
[1:05] basically a giant recipe.
[1:07] When you see a model labeled
[1:09] 7B, that means 7 billion,
[1:12] a seven with
[1:12] nine zeros after it.
[1:14] That's 7 billion tiny
[1:16] instructions that
[1:17] tell the AI how to think.
[1:19] More instructions, smarter AI,
[1:22] but also a bigger recipe that
[1:24] takes up more and
[1:25] more counter space.
[1:26] That entire recipe needs to sit
[1:29] on the counter
[1:30] while the chef works.
[1:31] If it's this, the chef works at
[1:33] full speed, chopping, plating,
[1:36] no wasted movement at all.
[1:38] But the second the recipe is
[1:39] just too big for the counter,
[1:41] the chef has to keep running to
[1:43] the back storage room
[1:44] to grab ingredients.
[1:46] That storage room, is your
[1:47] computer's regular memory,
[1:49] we call it REM instead of VRAM.
[1:52] And it is just
[1:52] way, way much slower.
[1:54] We're talking about going from
[1:56] a smooth 40 words per second
[1:59] down to maybe two to three
[2:00] words, which is just unusable.
[2:03] Now you might be wondering how
[2:05] do people fit
[2:05] these massive recipes
[2:07] on a normal counter?
[2:08] They use shorthand.
[2:10] Instead of writing every
[2:11] instruction in full
[2:12] detail handwriting,
[2:13] they compress it down. We call
[2:15] it a 4 bit compression,
[2:17] which is the same recipe just
[2:19] in a smaller notebook.
[2:21] You lose maybe a
[2:22] tiny bit of a detail,
[2:23] but it's fits on way less
[2:25] counter space at
[2:27] four bit compression.
[2:28] Here's a cheat sheet for you.
[2:30] A seven billion instruction
[2:31] model takes about five
[2:32] gigs of counter space.
[2:34] At 14 billion takes about 10,
[2:37] 32 billion takes about 20,
[2:39] 70 billion takes about 40.
[2:41] That's just the
[2:41] recipe sitting there.
[2:43] The chef hasn't even
[2:44] started cooking yet.
[2:46] The moment you
[2:46] start a conversation,
[2:47] the conversation memory starts
[2:49] growing right?
[2:50] So think of it like dirty dishes
[2:52] piling up on the
[2:53] counter while the chef cooks
[2:55] longer conversation. There's
[2:57] obviously more
[2:58] dishes piling up,
[2:59] less room for the recipe.
[3:01] That's why a model that loads
[3:03] perfectly fine can
[3:04] still slow to a crawl 20
[3:06] minutes into a conversation.
[3:07] The counter ran
[3:09] out of that space.
[3:10] So the question here isn't how
[3:12] fast is my chef?
[3:13] The real question is supposed
[3:15] to be like, how
[3:16] big is my counter?
[3:18] The VRAM that one number VRAM
[3:21] dictates almost
[3:21] everything about your
[3:23] local AI experience. Okay.
[3:27] So now you know
[3:28] the real bottleneck,
[3:29] but here's where most
[3:30] people still mess up.
[3:31] They pick the right counter and
[3:33] then just cheap out.
[3:34] They decided to cheap out on
[3:35] everything else in the kitchen,
[3:37] or they just over buy because
[3:38] some Reddit random
[3:39] Reddit posts told them they
[3:41] need a $5,000
[3:42] setup. Neither is true.
[3:44] Let me walk you through exactly
[3:45] what to buy at
[3:46] every budget level.
[3:48] So this is where most people
[3:50] should start and it's way more
[3:51] affordable than you
[3:53] may have thought. The core of
[3:54] this build is just
[3:55] one graphics card.
[3:57] The RTX 4060 Ti
[3:59] with 16 gigs of VRAM,
[4:02] 16 gigs of counter space, not
[4:05] the eight gigs
[4:05] version. That's the,
[4:06] that's the wrong version.
[4:07] That's the trap. Eight gigs,
[4:09] one fills up the second you
[4:10] load a real model
[4:11] plus the conversation.
[4:12] It's just the haywire, right?
[4:14] So dishes start
[4:15] piling up immediately.
[4:16] You need at least 12 or 16.
[4:18] Yeah, 12 or 16.
[4:21] I'm running mine on, uh, I
[4:23] think mine is 16.
[4:24] Yes. Around that,
[4:25] you build a simple desktop, a
[4:27] and now Ryzen five processor.
[4:29] That's the brain of the
[4:30] computer for local
[4:31] AI. It matters way,
[4:33] way less than you think, but
[4:35] you still need a
[4:36] decent one though.
[4:37] 64 gigs of system RAM. That's
[4:39] the backstories room.
[4:40] You want it big enough that if
[4:42] things spill off the counter,
[4:44] there's somewhere for them to
[4:45] go, you know, now
[4:46] a two terabyte SSD.
[4:48] That's your think of it like a
[4:49] pantry where all your model
[4:51] recipes are stored
[4:52] before you pull
[4:53] them onto the counter.
[4:54] Now there should be a decent
[4:56] power supply as well.
[4:58] That's the electrical panel for
[4:59] the whole kitchen and a case
[5:01] with a good air flow
[5:02] to keep it cool. So total cost
[5:05] total damage is about 1200 to
[5:08] $1,500. Obviously that's USD
[5:10] though. What does
[5:11] this actually run?
[5:13] It could run seven and 8
[5:14] billion instructions
[5:15] model very comfortably.
[5:17] That's your Qwen3 8 billion
[5:19] parameters model that I covered
[5:21] in the last video.
[5:22] You may have watched it. So
[5:23] your deepseek
[5:25] distilled 7 billion,
[5:26] your llama 8 billion. These
[5:28] are, these are not toys.
[5:29] These model handle real coding
[5:30] assistant document
[5:31] summaries, private chat,
[5:33] and light, very light Asian
[5:35] workflows as well.
[5:36] You can push a 14 billion
[5:38] instruction model on
[5:39] this build as well,
[5:41] but you'll feel
[5:42] some kind of trade off.
[5:43] So shorter conversation before
[5:45] the dishes pilot,
[5:46] maybe a slower output.
[5:48] It's still usable, but you're
[5:49] like bumping against the edge
[5:50] of the counter,
[5:52] the end of it, right? Now,
[5:53] if you're already in the Apple
[5:54] world deep into the ecosystem,
[5:56] a MacBook pro or Mac mini with
[5:58] 16 gigs of unified
[5:59] memory gets you into the
[6:01] same exact tier.
[6:03] Here's why Apple is a little
[6:04] different from PC
[6:05] on a regular PC.
[6:07] The graphics card has its own
[6:09] own counter and the
[6:11] computer has a separate
[6:12] storage room in the back room,
[6:13] but on a Mac,
[6:15] there's no storage room.
[6:16] It's just all one
[6:18] big freaking counter.
[6:19] The graphics and the main
[6:20] computer share the
[6:22] same pool of memory.
[6:24] That's what unified means.
[6:26] So 16 gigs on a Mac is all
[6:28] usable counter space,
[6:30] but there's a trade off max at
[6:32] this level are a
[6:34] little bit slower on raw
[6:35] speed compared to the dedicated
[6:37] Nvidia graphics
[6:38] card, but you know,
[6:40] simplicity is just too
[6:41] difficult to beat, Ollama one
[6:44] download and just your
[6:45] running models and minutes.
[6:48] This is the sweet spot for
[6:50] anyone doing serious local
[6:52] work, coding agents,
[6:54] document analysis, multi-step
[6:55] AI workflows. There
[6:57] are two paths here,
[6:57] I think. So path one, you can
[6:59] grab an RTX 4070 TI,
[7:02] super with 16 gigs
[7:03] of counter space,
[7:05] which is much faster chef hands
[7:07] than the 4060 TI
[7:09] more headroom better for
[7:11] agent style loops where the
[7:12] model is thinking
[7:13] properly executing
[7:15] and kind of thinking again. And
[7:17] there's other paths,
[7:18] which I think this is the move
[7:20] a lot of experienced local AI
[7:22] user power user
[7:24] people make. You can buy a used
[7:26] RTX 3090 way for it.
[7:29] It's an older car obviously,
[7:31] but like I told you,
[7:32] when it comes to local AI, the
[7:34] graphics card doesn't really
[7:35] matter. It's all V
[7:36] run. It's all V run.
[7:37] So it has 24 gigs of VRAM in
[7:40] this older graphics card.
[7:42] So 24 gigs of counter space is
[7:44] just a freaking
[7:45] different world at 24.
[7:46] You can run 32 billion
[7:48] instructions model and
[7:49] shorthand and still have room
[7:51] left over for a
[7:52] long conversation.
[7:54] So models like quant three 32 B
[7:56] or deep seek R1 distilled
[7:56] or deep seek R1 distilled
[7:58] 32 B and there are new
[8:00] models. I haven't
[8:01] tested those yet,
[8:01] but these are the models that
[8:02] but these are the models that
[8:03] start rivaling cloud quality
[8:04] for most everyday
[8:06] use cases like task. Rest of
[8:07] use cases.Rest of
[8:07] the build stays kind
[8:08] of similar, you know,
[8:09] Ryzen seven processors, 64 gigs
[8:11] of RAM to terabyte SSD,
[8:13] bigger the better. But you
[8:15] know, on the Mac side though,
[8:16] a Mac mini M4 pro with 64 gigs
[8:19] of unified memory
[8:20] lands you here too.
[8:21] That big share counter means
[8:23] you can load a 32 billion
[8:24] instructions model this
[8:25] and still have reading room for
[8:28] it. Speed is slower
[8:29] than the Nvidia cards
[8:30] but how slow are we talking
[8:32] about like a 10 to 11,
[8:36] 12 words per second. We call it
[8:38] tokens per second
[8:39] is the official term.
[8:40] And it basically means how many
[8:42] words the AI spits
[8:44] out each seconds.
[8:45] You know, like when you type it
[8:46] in the GPT and it
[8:48] generates the output,
[8:49] the speed of the outputs being
[8:51] generated that's
[8:52] tokens per second.
[8:53] And usually 10 to 15 feels like
[8:55] a person typing very fast.
[8:57] 30 plus feels like very
[8:59] instant. So 11 to 12 is,
[9:02] you're at a comfortable range,
[9:04] not blazing fast, but you know,
[9:06] comfortable and the Mac runs
[9:08] quietly sips power and just
[9:11] works smooth like
[9:12] butter. Now onto the next one.
[9:14] butter. Now onto the next one.
[9:16] This is only worth it.
[9:17] If your workflow
[9:17] genuinely demands it,
[9:19] don't buy this because it
[9:20] sounds cool. You gotta, you
[9:22] gotta take this seriously.
[9:24] The centerpiece is RTX 40 90
[9:26] with 24 gigs of counter space
[9:28] paired with a Ryzen
[9:29] nine processor, 128
[9:31] gigs of system RAM,
[9:33] which is a massive storage room
[9:35] for overflow, right?
[9:36] And a beefy power supply. This
[9:38] runs 32 billion
[9:39] instructions models,
[9:40] like butter, just fast chef,
[9:42] big counter,
[9:43] long conversations,
[9:45] just complex Asian chains.
[9:47] You can probably also
[9:49] experiment with 70 billion
[9:50] instructions model at heavy
[9:52] compression, but something that
[9:53] I haven't done it by myself.
[9:55] So I can't really tell you like
[9:57] if it's actually
[9:58] runnable right?
[9:59] So, but I heard it's working,
[10:01] but it's likely you're covering
[10:02] the entire counter
[10:03] with recipe pages.
[10:05] Like you can expect trade-offs
[10:07] on conversation
[10:07] lengths because there's just
[10:09] barely room for the dishes.
[10:11] Now the Apple equivalent is the
[10:13] max studio with an
[10:14] M3 ultra chip and 96
[10:17] gigs of unified
[10:18] memory, 96 gigs of counter.
[10:21] This thing loads multiple
[10:23] recipes at once.
[10:25] A reasoning model on embedding
[10:27] model, a coding model,
[10:28] all sitting on the counter is
[10:29] simultaneously and it idles at
[10:31] under a hundred
[10:32] Watts. The 40 90 desktop will
[10:34] draw five to 10
[10:36] times that under load.
[10:37] One more thing.
[10:38] You'll see people talking about
[10:40] RTX 15 90 builds with 32 gigs
[10:43] of VRAM and dual
[10:44] GPU set up pushing 64 gigs
[10:46] total. That stuff exists,
[10:49] but we're talking like 10 K
[10:50] setup 10 K plus.
[10:51] So that's like,
[10:53] I say that's the highest end of
[10:55] consumer grade
[10:56] graphics, graphics card,
[10:57] the end the boss level.
[10:59] This is for like people who
[11:01] actually want to train their
[11:02] own AI model to do
[11:05] something with it.
[11:06] Like I don't know if you guys
[11:07] seen a recent PewDiePie video
[11:10] where he trained his
[11:11] own AI models for six plus
[11:13] months to cross the,
[11:15] I forgot the name of the
[11:16] benchmarks, but he did, which
[11:18] was pretty interesting.
[11:21] Anyway, quick note on a
[11:23] Raspberry Pi. I love the pie.
[11:25] I don't use it anymore, but I
[11:26] made videos
[11:26] about it in the past,
[11:28] but it's not a
[11:29] local AI daily driver.
[11:30] A PI five is great for like
[11:33] edge experiments,
[11:35] like running OpenClaw for
[11:36] sandbox agent execution,
[11:38] computer vision projects. But
[11:40] if you're trying to
[11:41] run a real language,
[11:42] large language model
[11:43] for chat or coding,
[11:46] you need one of the three tiers
[11:47] that I mentioned above instead
[11:49] of the Raspberry
[11:49] Pi. The PI is the garage
[11:52] workshop for small projects,
[11:54] but the tiers above are for the
[11:56] actual kitchen.
[11:59] If hardware is
[12:00] half the equation,
[12:01] here's the software side, and
[12:03] I'm going to keep
[12:03] this very tight.
[12:04] So for getting models running
[12:06] two options dominate right now.
[12:08] Olama. This is one that I use
[12:10] as a command line tool.
[12:11] So it's a little high learning
[12:13] cup, but it's dead simple. You
[12:15] type one command,
[12:16] the model downloads and loads
[12:17] onto your computer and it's
[12:19] just running there.
[12:20] Works on Mac, Windows and
[12:21] Linux. You can also download it
[12:22] off of their website.
[12:23] LM Studio, which is the next
[12:25] one, the same idea, but with a
[12:26] visual interface,
[12:28] what I mean by visual interface
[12:29] is like a chat GPT. There's a
[12:30] chat window there.
[12:32] So if the command line
[12:33] interface kind of
[12:34] make you nervous,
[12:36] you can start here LM Studio.
[12:38] Both of these handle model
[12:39] downloading graphics card
[12:40] detection and serving the
[12:41] model locally. So your others
[12:43] tools can talk to it.
[12:44] But as far as I remember, these
[12:45] two have different size of
[12:47] context window.
[12:48] Now here's a detail that'll
[12:49] save you real frustration here.
[12:51] Model files comes in different
[12:52] packaging formats.
[12:53] Think of it like how the same
[12:55] movie can be a
[12:56] DVD or if you know,
[12:58] Blu-ray, same content,
[13:00] different packaging optimized
[13:01] for just different players.
[13:03] There's GGUF, it's Guff. It's
[13:05] the format that
[13:06] plays best on Macs.
[13:08] So AWQ is the one built for Nvidia
[13:10] graphics card. If
[13:11] you're on a Mac,
[13:12] just grab GGUF or MLX. If you're on
[13:14] Windows using Linux server with an Nvidia card,
[13:16] you can look into AWQ because
[13:17] what I heard that there was a
[13:19] test that showed
[13:20] that it gave the faster
[13:22] response time and better
[13:23] quality output compared to
[13:25] GGUF on the same card. So most
[13:27] people don't know this.
[13:28] They just grab whatever model
[13:30] files has the most downloads.
[13:32] And if they're using the wrong
[13:33] format for their machine,
[13:35] they're kind of leaving speed
[13:36] on the table there. So for
[13:38] agent workflows,
[13:39] you can plug Ollama into tools
[13:41] like N8N for automation,
[13:44] crew AI for multi-agent setup
[13:46] or build custom pipelines.
[13:47] But that's a whole separate
[13:48] video I'll cover later.
[13:52] I want to be straight with you
[13:53] because I think most
[13:54] AI content online isn't.
[13:57] Local AI agent is not a
[13:59] replacement for cloud AI,
[14:01] closed frontier models. I'm
[14:03] talking about chat GPT,
[14:05] Claude, Gemini, just not yet.
[14:07] Maybe not even, not
[14:08] ever for everything.
[14:10] You know, the biggest, most
[14:11] powerful reasoning models to
[14:13] still live in the cloud.
[14:14] They're all U.S made. When I
[14:16] need frontier level
[14:17] thinking on a complex
[14:19] problem, I use Claude or GPT.
[14:21] Just that's the,
[14:22] that's the truth.
[14:23] But I now have a very high
[14:26] expectation of a new
[14:28] deep seek 4
[14:29] coming. I don't know when.
[14:31] But hopefully it doesn't crash
[14:32] the stock market
[14:33] this time. Anyway,
[14:35] here's what local does better
[14:36] than anything in the cloud.
[14:38] That's just the privacy side,
[14:40] right? Your data
[14:40] never leaves your machine.
[14:42] No terms of service, no
[14:43] training on your prompts,
[14:45] no training on your secrets,
[14:47] your private life.
[14:49] No API logs, cost control, no
[14:51] surprise bills,
[14:53] which is very important.
[14:54] No token meters running.
[14:56] You just pay once for the
[14:57] hardware and every conversation
[14:59] after that is just
[14:59] free. Up time, maybe your
[15:01] internet goes down.
[15:02] Your local model does not care.
[15:04] It just keeps
[15:05] working. That's how it works.
[15:06] It just lives in your device
[15:08] once downloaded.
[15:09] Think of it like this.
[15:10] Local AI is your home gym and
[15:13] cloud AI is like a commercial
[15:15] gym in downtown. Your home gym
[15:17] can handle maybe
[15:17] 80% of your workflows.
[15:19] It's always open, always
[15:20] private. You never
[15:21] wait for a machine.
[15:23] But once in a while you need
[15:24] the heavy equipment downtime.
[15:26] You might want to do some
[15:27] deadlifts. That's
[15:28] fine. You use both.
[15:29] The smartest setup in 2026 and
[15:31] probably beyond
[15:33] is just a hybrid.
[15:34] Local for the daily work cloud
[15:36] through the heavy lifting and
[15:37] the builds I just
[15:38] showed you are exactly where
[15:40] that local side starts.
[15:42] So I know you love summary. So
[15:43] here's the summary.
[15:45] Buy the counter
[15:45] space, not hand speed.
[15:47] VRAM is the number that you
[15:49] need to focus on.
[15:50] Everything else is just
[15:50] secondary budget, the whole
[15:52] kitchen, not just the chef.
[15:53] If you're new, you can start at
[15:55] tier one, 1200
[15:56] to $1,500 builds.
[15:58] It's real, it's capable, and
[15:59] you can always upgrade the
[16:00] graphics card later.
[16:02] Too easy, right? So if this
[16:03] helped you figure
[16:03] out your builds,
[16:05] why don't you drop a comment
[16:06] with a tier that
[16:07] you're going for?
[16:07] I read every single comment.
[16:09] And if you want the full part
[16:10] with links inside of it, I'll
[16:12] pin it in the comments.
[16:14] Thanks for watching. Bye now.