Running 80B model on 16GB GPU?
45sShocking premise challenges common belief that huge models need expensive hardware.
▶ Play ClipThis video tests running the 80-billion-parameter Qwen 3 Coder Next model on a 16GB RTX 5060 Ti GPU using a 3-bit quantized version. The model performs surprisingly well on moderate tasks but struggles with high complexity, compared against Gemini 3.1 Pro.
Testing 80B model on 16GB GPU with 3-bit quantized version to fit VRAM.
Alibaba's Qwen 3 Coder Next has 80B parameters but only 3B active, enabling consumer GPU use.
50K context length, half model offloaded to GPU, 1GB free VRAM, flash attention enabled.
Model created a working 3D audio visualizer in one shot within 10 minutes.
Model handled most of a complex prompt but missed some details like blur layer and animations.
Model struggled with a very detailed concept, producing broken horizontal scrolling and layout issues.
Model wrote a playable arcade spaceship shooter in 6 minutes, easily updated with sprites.
Running an 80B model on a 16GB GPU is surprisingly practical for moderate tasks, but high complexity still favors cloud models like Gemini Pro.
"Title accurately reflects the experiment's surprising success, though 'surprisingly good' is slightly subjective."
What is the parameter count of Qwen 3 Coder Next?
80 billion parameters.
0:24
How many active parameters does Qwen 3 Coder Next have?
3 billion active parameters.
0:35
What quantization was used to fit the model on 16GB VRAM?
3-bit Imatrix quantization.
1:11
What context length was used in the experiment?
50K context length.
1:29
How long did the first test (3D audio visualizer) take?
Around 10 minutes.
2:07
What was the main issue with the high complexity test?
Horizontal scrolling was totally broken.
4:03
How long did the Python game test take?
Around 6 minutes.
4:57
Impressive One-Shot Result
Demonstrates that a heavily quantized 80B model can produce a working 3D visualizer on consumer hardware.
2:20Comparison with Gemini Pro
Shows local model performance is not far from cloud model for moderate tasks.
3:12Practicality Verdict
Confirms that running large models locally is now practical for many tasks.
5:18[00:00] In previous video, we tested the limit
[00:02] of 8 gigs GPU by running a 30 billion
[00:04] models on it. In this video, we're going
[00:06] to test it even further by running a 80
[00:09] billion model on a 16 gigs card and then
[00:11] compare the result with Gemini 3.1 Pro.
[00:14] Will it works? Is it practical? We'll
[00:16] find out in this video.
[00:22] Couple weeks ago, Alibaba released a new
[00:24] coder model, Quen 3 Coder Next, and 80
[00:27] billion parameters, yet score really
[00:30] well, almost on par with models much
[00:32] larger than its size. Quen 3 coder Next
[00:35] also has only 3 billion active
[00:37] parameters, which should allow us to run
[00:39] it on consumer graphics card with
[00:40] reasonably fast speed. And today, we're
[00:42] going to see if we can fit this model
[00:44] into a RTX 5060 Ti 16 gigs. So, here is
[00:48] my portable AIP PC. 32 gigs of RAM and
[00:51] 16 gigs of VRAM. It's natural that we
[00:54] have to use a quantized version in order
[00:56] to run it. Looking at its size, at
[00:58] first, the 4-bit quant version seems
[01:01] like it will fit. However, we'll be left
[01:03] without any spare system memory to use
[01:06] and very limited context, so it's a bit
[01:08] impractical. The 3-bit Imatrix version
[01:11] with 33 gigs in size should give us
[01:13] plenty of breathing room with almost the
[01:15] same intelligent with Q3KM version. And
[01:18] if you use Unslo quantized version, they
[01:20] even did some benchmark showing that
[01:22] their 3bit Imatrix quant performed
[01:24] better than non-unslaught Q4 and almost
[01:26] on par with the full precision model.
[01:29] For the setting, I'm using 50K context
[01:31] length and offload around half of the
[01:33] model to GPU, leaving 1 gigs of free
[01:36] VRAM. For the systems, I use flash
[01:39] attention but did not use KV cache
[01:41] quantization since the model itself was
[01:44] quantized enough as it is already. So I
[01:46] want to avoid any other potential
[01:47] quality degradation if possible. So the
[01:50] first test I'll ask it to create a 3D
[01:52] audio visualizer web app using 3JS
[01:55] particle to react to the sound using a
[01:57] blank web project as a base.
[02:00] Despite being an 80 billion model
[02:02] running on 16 gigs card, the token
[02:04] generation speed is pretty fast. The
[02:07] model completed the task in around 10
[02:08] minutes and here is the result.
[02:20] It works beautifully. I'm really amazed
[02:23] I can get this result in one shot from
[02:25] local AI running on 16 gigs card. We can
[02:28] definitely polish it further, but for
[02:30] the first try, this is a very good
[02:32] result. So, the next test, we're going
[02:34] for a specific prompt. I found this good
[02:37] prompt example on X that he used it to
[02:39] test Gemini 3.1 Pro. So, let's use it to
[02:42] test with Quen 3 Next as well.
[02:49] The prompt again took only about 10
[02:51] minutes and the result is pretty good
[02:52] for the first attempt. The model handled
[02:54] most of the prompt correctly with the
[02:56] exception of some. For example, it
[02:58] missed the blur layer for the glass
[02:59] morphism navbar. Some of the animation
[03:01] are not working and minor visual bug.
[03:12] However, if we compare the result with
[03:14] Gemini 3.1 Pro, even though Gemini
[03:16] handled the animation a lot better, Quen
[03:18] 3 coder next result doesn't really look
[03:20] that far apart.
[03:30] So moving on to the next test, we'll add
[03:32] more complexity by giving the model very
[03:35] detailed concept such as 3JS particles,
[03:38] horizontal scrolling, several animation,
[03:41] and different section layout. This time
[03:43] the prompt took around 30 minutes to
[03:45] complete and here is the result. This
[03:48] time we start to see the limit of the
[03:50] heavily quantized 80 billion model with
[03:53] complex concept prompt. The first
[03:55] section, navbar and menu look quite okay
[03:58] at first, but the other sections are
[04:00] quite bad, especially the horizontal
[04:03] scrolling, which is totally broken.
[04:15] Gemini handled this test much better.
[04:47] The final test. We're switching from web
[04:50] development to Python. The prompt is to
[04:52] write a simple arcade spaceship shooter
[04:55] game. The model took around 6 minutes to
[04:57] complete the task. And here is the
[04:59] result. The game is playable at the
[05:02] first attempt. And since I didn't
[05:04] provide the sprite image, the model used
[05:06] simple rectangle for the spaceship
[05:07] instead. So I provided them and told the
[05:10] model to update the code. Here is the
[05:12] final result.
[05:18] So running an 80 billion model on a 16
[05:20] gig 5060 Ti isn't just a fun experiment,
[05:24] it's surprisingly practical. Even at
[05:26] 3bit quantized, Quentry Coder Next
[05:29] proved it can singleshot various tasks
[05:31] at moderate difficulty with impressive
[05:33] speed and accuracy. It still hits a wall
[05:36] on high complexity task which is where
[05:38] cloud model like Gemini Pro still hold
[05:41] the crown. But for local model on a
[05:43] consumer GPU, this is a massive leap
[05:45] forward. Let me know down in the
[05:47] comments what you want me to test next.
[05:49] And if you found this video helpful,
[05:50] like and subscribe for more. Thanks for
[05:53] watching. See you in the next one.
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.