[0:00] There's one number on the stats [0:01] of your graphics card that [0:03] matters more than [0:04] everything else combined for [0:06] local AI agents. [0:07] And it's not the one you think [0:08] most people build their AI [0:10] computer the same way. [0:10] They would build the gaming PC [0:12] faster processor, [0:14] bigger graphics [0:15] cards and just more power. [0:16] That's exactly why their setup [0:17] runs like a garbage. [0:18] In my previous business, [0:19] I used to overcharge people for [0:21] this stuff and made [0:22] me just sick of it. [0:23] So I had to quit. [0:24] Now I just show you how to [0:25] build it yourself for [0:27] completely free. [0:29] Let me explain this in a [0:30] simplest way I can. [0:32] We're going to stay in one [0:33] analogy for most of this video. [0:35] So stick with me. [0:36] Your local AI setup is like a [0:39] restaurant kitchen, [0:40] the graphics card. [0:41] That's the part of your [0:42] computer that does [0:42] all the heavy math. [0:44] Think of it as the chef, how [0:45] fast the chef's hands move, [0:47] how quickly they [0:47] can chop, stir, plate, [0:49] but the memory on the graphics [0:50] card called VRAM, [0:52] that's the size of [0:53] the kitchen counter. [0:54] Remember that. And here's the [0:55] thing nobody's [0:56] really talking about. [0:57] The counter size matters more [0:59] than hand speed. [1:01] Here's what actually happens [1:02] when you run AI locally. [1:04] The AI model is [1:05] basically a giant recipe. [1:07] When you see a model labeled [1:09] 7B, that means 7 billion, [1:12] a seven with [1:12] nine zeros after it. [1:14] That's 7 billion tiny [1:16] instructions that [1:17] tell the AI how to think. [1:19] More instructions, smarter AI, [1:22] but also a bigger recipe that [1:24] takes up more and [1:25] more counter space. [1:26] That entire recipe needs to sit [1:29] on the counter [1:30] while the chef works. [1:31] If it's this, the chef works at [1:33] full speed, chopping, plating, [1:36] no wasted movement at all. [1:38] But the second the recipe is [1:39] just too big for the counter, [1:41] the chef has to keep running to [1:43] the back storage room [1:44] to grab ingredients. [1:46] That storage room, is your [1:47] computer's regular memory, [1:49] we call it REM instead of VRAM. [1:52] And it is just [1:52] way, way much slower. [1:54] We're talking about going from [1:56] a smooth 40 words per second [1:59] down to maybe two to three [2:00] words, which is just unusable. [2:03] Now you might be wondering how [2:05] do people fit [2:05] these massive recipes [2:07] on a normal counter? [2:08] They use shorthand. [2:10] Instead of writing every [2:11] instruction in full [2:12] detail handwriting, [2:13] they compress it down. We call [2:15] it a 4 bit compression, [2:17] which is the same recipe just [2:19] in a smaller notebook. [2:21] You lose maybe a [2:22] tiny bit of a detail, [2:23] but it's fits on way less [2:25] counter space at [2:27] four bit compression. [2:28] Here's a cheat sheet for you. [2:30] A seven billion instruction [2:31] model takes about five [2:32] gigs of counter space. [2:34] At 14 billion takes about 10, [2:37] 32 billion takes about 20, [2:39] 70 billion takes about 40. [2:41] That's just the [2:41] recipe sitting there. [2:43] The chef hasn't even [2:44] started cooking yet. [2:46] The moment you [2:46] start a conversation, [2:47] the conversation memory starts [2:49] growing right? [2:50] So think of it like dirty dishes [2:52] piling up on the [2:53] counter while the chef cooks [2:55] longer conversation. There's [2:57] obviously more [2:58] dishes piling up, [2:59] less room for the recipe. [3:01] That's why a model that loads [3:03] perfectly fine can [3:04] still slow to a crawl 20 [3:06] minutes into a conversation. [3:07] The counter ran [3:09] out of that space. [3:10] So the question here isn't how [3:12] fast is my chef? [3:13] The real question is supposed [3:15] to be like, how [3:16] big is my counter? [3:18] The VRAM that one number VRAM [3:21] dictates almost [3:21] everything about your [3:23] local AI experience. Okay. [3:27] So now you know [3:28] the real bottleneck, [3:29] but here's where most [3:30] people still mess up. [3:31] They pick the right counter and [3:33] then just cheap out. [3:34] They decided to cheap out on [3:35] everything else in the kitchen, [3:37] or they just over buy because [3:38] some Reddit random [3:39] Reddit posts told them they [3:41] need a $5,000 [3:42] setup. Neither is true. [3:44] Let me walk you through exactly [3:45] what to buy at [3:46] every budget level. [3:48] So this is where most people [3:50] should start and it's way more [3:51] affordable than you [3:53] may have thought. The core of [3:54] this build is just [3:55] one graphics card. [3:57] The RTX 4060 Ti [3:59] with 16 gigs of VRAM, [4:02] 16 gigs of counter space, not [4:05] the eight gigs [4:05] version. That's the, [4:06] that's the wrong version. [4:07] That's the trap. Eight gigs, [4:09] one fills up the second you [4:10] load a real model [4:11] plus the conversation. [4:12] It's just the haywire, right? [4:14] So dishes start [4:15] piling up immediately. [4:16] You need at least 12 or 16. [4:18] Yeah, 12 or 16. [4:21] I'm running mine on, uh, I [4:23] think mine is 16. [4:24] Yes. Around that, [4:25] you build a simple desktop, a [4:27] and now Ryzen five processor. [4:29] That's the brain of the [4:30] computer for local [4:31] AI. It matters way, [4:33] way less than you think, but [4:35] you still need a [4:36] decent one though. [4:37] 64 gigs of system RAM. That's [4:39] the backstories room. [4:40] You want it big enough that if [4:42] things spill off the counter, [4:44] there's somewhere for them to [4:45] go, you know, now [4:46] a two terabyte SSD. [4:48] That's your think of it like a [4:49] pantry where all your model [4:51] recipes are stored [4:52] before you pull [4:53] them onto the counter. [4:54] Now there should be a decent [4:56] power supply as well. [4:58] That's the electrical panel for [4:59] the whole kitchen and a case [5:01] with a good air flow [5:02] to keep it cool. So total cost [5:05] total damage is about 1200 to [5:08] $1,500. Obviously that's USD [5:10] though. What does [5:11] this actually run? [5:13] It could run seven and 8 [5:14] billion instructions [5:15] model very comfortably. [5:17] That's your Qwen3 8 billion [5:19] parameters model that I covered [5:21] in the last video. [5:22] You may have watched it. So [5:23] your deepseek [5:25] distilled 7 billion, [5:26] your llama 8 billion. These [5:28] are, these are not toys. [5:29] These model handle real coding [5:30] assistant document [5:31] summaries, private chat, [5:33] and light, very light Asian [5:35] workflows as well. [5:36] You can push a 14 billion [5:38] instruction model on [5:39] this build as well, [5:41] but you'll feel [5:42] some kind of trade off. [5:43] So shorter conversation before [5:45] the dishes pilot, [5:46] maybe a slower output. [5:48] It's still usable, but you're [5:49] like bumping against the edge [5:50] of the counter, [5:52] the end of it, right? Now, [5:53] if you're already in the Apple [5:54] world deep into the ecosystem, [5:56] a MacBook pro or Mac mini with [5:58] 16 gigs of unified [5:59] memory gets you into the [6:01] same exact tier. [6:03] Here's why Apple is a little [6:04] different from PC [6:05] on a regular PC. [6:07] The graphics card has its own [6:09] own counter and the [6:11] computer has a separate [6:12] storage room in the back room, [6:13] but on a Mac, [6:15] there's no storage room. [6:16] It's just all one [6:18] big freaking counter. [6:19] The graphics and the main [6:20] computer share the [6:22] same pool of memory. [6:24] That's what unified means. [6:26] So 16 gigs on a Mac is all [6:28] usable counter space, [6:30] but there's a trade off max at [6:32] this level are a [6:34] little bit slower on raw [6:35] speed compared to the dedicated [6:37] Nvidia graphics [6:38] card, but you know, [6:40] simplicity is just too [6:41] difficult to beat, Ollama one [6:44] download and just your [6:45] running models and minutes. [6:48] This is the sweet spot for [6:50] anyone doing serious local [6:52] work, coding agents, [6:54] document analysis, multi-step [6:55] AI workflows. There [6:57] are two paths here, [6:57] I think. So path one, you can [6:59] grab an RTX 4070 TI, [7:02] super with 16 gigs [7:03] of counter space, [7:05] which is much faster chef hands [7:07] than the 4060 TI [7:09] more headroom better for [7:11] agent style loops where the [7:12] model is thinking [7:13] properly executing [7:15] and kind of thinking again. And [7:17] there's other paths, [7:18] which I think this is the move [7:20] a lot of experienced local AI [7:22] user power user [7:24] people make. You can buy a used [7:26] RTX 3090 way for it. [7:29] It's an older car obviously, [7:31] but like I told you, [7:32] when it comes to local AI, the [7:34] graphics card doesn't really [7:35] matter. It's all V [7:36] run. It's all V run. [7:37] So it has 24 gigs of VRAM in [7:40] this older graphics card. [7:42] So 24 gigs of counter space is [7:44] just a freaking [7:45] different world at 24. [7:46] You can run 32 billion [7:48] instructions model and [7:49] shorthand and still have room [7:51] left over for a [7:52] long conversation. [7:54] So models like quant three 32 B [7:56] or deep seek R1 distilled [7:56] or deep seek R1 distilled [7:58] 32 B and there are new [8:00] models. I haven't [8:01] tested those yet, [8:01] but these are the models that [8:02] but these are the models that [8:03] start rivaling cloud quality [8:04] for most everyday [8:06] use cases like task. Rest of [8:07] use cases.Rest of [8:07] the build stays kind [8:08] of similar, you know, [8:09] Ryzen seven processors, 64 gigs [8:11] of RAM to terabyte SSD, [8:13] bigger the better. But you [8:15] know, on the Mac side though, [8:16] a Mac mini M4 pro with 64 gigs [8:19] of unified memory [8:20] lands you here too. [8:21] That big share counter means [8:23] you can load a 32 billion [8:24] instructions model this [8:25] and still have reading room for [8:28] it. Speed is slower [8:29] than the Nvidia cards [8:30] but how slow are we talking [8:32] about like a 10 to 11, [8:36] 12 words per second. We call it [8:38] tokens per second [8:39] is the official term. [8:40] And it basically means how many [8:42] words the AI spits [8:44] out each seconds. [8:45] You know, like when you type it [8:46] in the GPT and it [8:48] generates the output, [8:49] the speed of the outputs being [8:51] generated that's [8:52] tokens per second. [8:53] And usually 10 to 15 feels like [8:55] a person typing very fast. [8:57] 30 plus feels like very [8:59] instant. So 11 to 12 is, [9:02] you're at a comfortable range, [9:04] not blazing fast, but you know, [9:06] comfortable and the Mac runs [9:08] quietly sips power and just [9:11] works smooth like [9:12] butter. Now onto the next one. [9:14] butter. Now onto the next one. [9:16] This is only worth it. [9:17] If your workflow [9:17] genuinely demands it, [9:19] don't buy this because it [9:20] sounds cool. You gotta, you [9:22] gotta take this seriously. [9:24] The centerpiece is RTX 40 90 [9:26] with 24 gigs of counter space [9:28] paired with a Ryzen [9:29] nine processor, 128 [9:31] gigs of system RAM, [9:33] which is a massive storage room [9:35] for overflow, right? [9:36] And a beefy power supply. This [9:38] runs 32 billion [9:39] instructions models, [9:40] like butter, just fast chef, [9:42] big counter, [9:43] long conversations, [9:45] just complex Asian chains. [9:47] You can probably also [9:49] experiment with 70 billion [9:50] instructions model at heavy [9:52] compression, but something that [9:53] I haven't done it by myself. [9:55] So I can't really tell you like [9:57] if it's actually [9:58] runnable right? [9:59] So, but I heard it's working, [10:01] but it's likely you're covering [10:02] the entire counter [10:03] with recipe pages. [10:05] Like you can expect trade-offs [10:07] on conversation [10:07] lengths because there's just [10:09] barely room for the dishes. [10:11] Now the Apple equivalent is the [10:13] max studio with an [10:14] M3 ultra chip and 96 [10:17] gigs of unified [10:18] memory, 96 gigs of counter. [10:21] This thing loads multiple [10:23] recipes at once. [10:25] A reasoning model on embedding [10:27] model, a coding model, [10:28] all sitting on the counter is [10:29] simultaneously and it idles at [10:31] under a hundred [10:32] Watts. The 40 90 desktop will [10:34] draw five to 10 [10:36] times that under load. [10:37] One more thing. [10:38] You'll see people talking about [10:40] RTX 15 90 builds with 32 gigs [10:43] of VRAM and dual [10:44] GPU set up pushing 64 gigs [10:46] total. That stuff exists, [10:49] but we're talking like 10 K [10:50] setup 10 K plus. [10:51] So that's like, [10:53] I say that's the highest end of [10:55] consumer grade [10:56] graphics, graphics card, [10:57] the end the boss level. [10:59] This is for like people who [11:01] actually want to train their [11:02] own AI model to do [11:05] something with it. [11:06] Like I don't know if you guys [11:07] seen a recent PewDiePie video [11:10] where he trained his [11:11] own AI models for six plus [11:13] months to cross the, [11:15] I forgot the name of the [11:16] benchmarks, but he did, which [11:18] was pretty interesting. [11:21] Anyway, quick note on a [11:23] Raspberry Pi. I love the pie. [11:25] I don't use it anymore, but I [11:26] made videos [11:26] about it in the past, [11:28] but it's not a [11:29] local AI daily driver. [11:30] A PI five is great for like [11:33] edge experiments, [11:35] like running OpenClaw for [11:36] sandbox agent execution, [11:38] computer vision projects. But [11:40] if you're trying to [11:41] run a real language, [11:42] large language model [11:43] for chat or coding, [11:46] you need one of the three tiers [11:47] that I mentioned above instead [11:49] of the Raspberry [11:49] Pi. The PI is the garage [11:52] workshop for small projects, [11:54] but the tiers above are for the [11:56] actual kitchen. [11:59] If hardware is [12:00] half the equation, [12:01] here's the software side, and [12:03] I'm going to keep [12:03] this very tight. [12:04] So for getting models running [12:06] two options dominate right now. [12:08] Olama. This is one that I use [12:10] as a command line tool. [12:11] So it's a little high learning [12:13] cup, but it's dead simple. You [12:15] type one command, [12:16] the model downloads and loads [12:17] onto your computer and it's [12:19] just running there. [12:20] Works on Mac, Windows and [12:21] Linux. You can also download it [12:22] off of their website. [12:23] LM Studio, which is the next [12:25] one, the same idea, but with a [12:26] visual interface, [12:28] what I mean by visual interface [12:29] is like a chat GPT. There's a [12:30] chat window there. [12:32] So if the command line [12:33] interface kind of [12:34] make you nervous, [12:36] you can start here LM Studio. [12:38] Both of these handle model [12:39] downloading graphics card [12:40] detection and serving the [12:41] model locally. So your others [12:43] tools can talk to it. [12:44] But as far as I remember, these [12:45] two have different size of [12:47] context window. [12:48] Now here's a detail that'll [12:49] save you real frustration here. [12:51] Model files comes in different [12:52] packaging formats. [12:53] Think of it like how the same [12:55] movie can be a [12:56] DVD or if you know, [12:58] Blu-ray, same content, [13:00] different packaging optimized [13:01] for just different players. [13:03] There's GGUF, it's Guff. It's [13:05] the format that [13:06] plays best on Macs. [13:08] So AWQ is the one built for Nvidia [13:10] graphics card. If [13:11] you're on a Mac, [13:12] just grab GGUF or MLX. If you're on [13:14] Windows using Linux server with an Nvidia card, [13:16] you can look into AWQ because [13:17] what I heard that there was a [13:19] test that showed [13:20] that it gave the faster [13:22] response time and better [13:23] quality output compared to [13:25] GGUF on the same card. So most [13:27] people don't know this. [13:28] They just grab whatever model [13:30] files has the most downloads. [13:32] And if they're using the wrong [13:33] format for their machine, [13:35] they're kind of leaving speed [13:36] on the table there. So for [13:38] agent workflows, [13:39] you can plug Ollama into tools [13:41] like N8N for automation, [13:44] crew AI for multi-agent setup [13:46] or build custom pipelines. [13:47] But that's a whole separate [13:48] video I'll cover later. [13:52] I want to be straight with you [13:53] because I think most [13:54] AI content online isn't. [13:57] Local AI agent is not a [13:59] replacement for cloud AI, [14:01] closed frontier models. I'm [14:03] talking about chat GPT, [14:05] Claude, Gemini, just not yet. [14:07] Maybe not even, not [14:08] ever for everything. [14:10] You know, the biggest, most [14:11] powerful reasoning models to [14:13] still live in the cloud. [14:14] They're all U.S made. When I [14:16] need frontier level [14:17] thinking on a complex [14:19] problem, I use Claude or GPT. [14:21] Just that's the, [14:22] that's the truth. [14:23] But I now have a very high [14:26] expectation of a new [14:28] deep seek 4 [14:29] coming. I don't know when. [14:31] But hopefully it doesn't crash [14:32] the stock market [14:33] this time. Anyway, [14:35] here's what local does better [14:36] than anything in the cloud. [14:38] That's just the privacy side, [14:40] right? Your data [14:40] never leaves your machine. [14:42] No terms of service, no [14:43] training on your prompts, [14:45] no training on your secrets, [14:47] your private life. [14:49] No API logs, cost control, no [14:51] surprise bills, [14:53] which is very important. [14:54] No token meters running. [14:56] You just pay once for the [14:57] hardware and every conversation [14:59] after that is just [14:59] free. Up time, maybe your [15:01] internet goes down. [15:02] Your local model does not care. [15:04] It just keeps [15:05] working. That's how it works. [15:06] It just lives in your device [15:08] once downloaded. [15:09] Think of it like this. [15:10] Local AI is your home gym and [15:13] cloud AI is like a commercial [15:15] gym in downtown. Your home gym [15:17] can handle maybe [15:17] 80% of your workflows. [15:19] It's always open, always [15:20] private. You never [15:21] wait for a machine. [15:23] But once in a while you need [15:24] the heavy equipment downtime. [15:26] You might want to do some [15:27] deadlifts. That's [15:28] fine. You use both. [15:29] The smartest setup in 2026 and [15:31] probably beyond [15:33] is just a hybrid. [15:34] Local for the daily work cloud [15:36] through the heavy lifting and [15:37] the builds I just [15:38] showed you are exactly where [15:40] that local side starts. [15:42] So I know you love summary. So [15:43] here's the summary. [15:45] Buy the counter [15:45] space, not hand speed. [15:47] VRAM is the number that you [15:49] need to focus on. [15:50] Everything else is just [15:50] secondary budget, the whole [15:52] kitchen, not just the chef. [15:53] If you're new, you can start at [15:55] tier one, 1200 [15:56] to $1,500 builds. [15:58] It's real, it's capable, and [15:59] you can always upgrade the [16:00] graphics card later. [16:02] Too easy, right? So if this [16:03] helped you figure [16:03] out your builds, [16:05] why don't you drop a comment [16:06] with a tier that [16:07] you're going for? [16:07] I read every single comment. [16:09] And if you want the full part [16:10] with links inside of it, I'll [16:12] pin it in the comments. [16:14] Thanks for watching. Bye now.