TubeSum ← Transcribe a video

I ran 80B model on 16GB GPU - It's surprisingly good! (Qwen 3 Coder Next Review) thumbnail

I ran 80B model on 16GB GPU - It's surprisingly good! (Qwen 3 Coder Next Review)

0h 05m video Published Feb 25, 2026 Transcribed Jul 28, 2026 Red Stapler

Red Stapler

Intermediate 2 min read For: AI enthusiasts and developers interested in running large language models on consumer GPUs.

AI Trust Score 85/100

✅ Highly Legit

"Title accurately reflects the experiment's surprising success, though 'surprisingly good' is slightly subjective."

AI Summary

This video tests running the 80-billion-parameter Qwen 3 Coder Next model on a 16GB RTX 5060 Ti GPU using a 3-bit quantized version. The model performs surprisingly well on moderate tasks but struggles with high complexity, compared against Gemini 3.1 Pro.

Chapters

1 Introduction and Setup 0:00 2 First Test: 3D Audio Visualizer 1:50 3 Second Test: Complex Web UI 2:34 4 Third Test: High Complexity 3:30 5 Final Test: Python Game 4:47 6 Conclusion 5:18

[0:00]

Experiment Setup

Testing 80B model on 16GB GPU with 3-bit quantized version to fit VRAM.

[0:22]

Model Introduction

Alibaba's Qwen 3 Coder Next has 80B parameters but only 3B active, enabling consumer GPU use.

[1:29]

Configuration

50K context length, half model offloaded to GPU, 1GB free VRAM, flash attention enabled.

[2:00]

First Test: 3D Audio Visualizer

Model created a working 3D audio visualizer in one shot within 10 minutes.

[2:49]

Second Test: Complex Web UI

Model handled most of a complex prompt but missed some details like blur layer and animations.

[3:30]

Third Test: High Complexity

Model struggled with a very detailed concept, producing broken horizontal scrolling and layout issues.

[4:47]

Final Test: Python Game

Model wrote a playable arcade spaceship shooter in 6 minutes, easily updated with sprites.

Running an 80B model on a 16GB GPU is surprisingly practical for moderate tasks, but high complexity still favors cloud models like Gemini Pro.

Mentioned in this Video

Qwen 3 Coder Next

tool

Gemini 3.1 Pro

tool

RTX 5060 Ti 16GB

tool

Study Flashcards (7)

What is the parameter count of Qwen 3 Coder Next?

easy Click to reveal answer

80 billion parameters.

0:24

How many active parameters does Qwen 3 Coder Next have?

easy Click to reveal answer

3 billion active parameters.

0:35

What quantization was used to fit the model on 16GB VRAM?

medium Click to reveal answer

3-bit Imatrix quantization.

1:11

What context length was used in the experiment?

medium Click to reveal answer

50K context length.

1:29

How long did the first test (3D audio visualizer) take?

easy Click to reveal answer

Around 10 minutes.

2:07

What was the main issue with the high complexity test?

medium Click to reveal answer

Horizontal scrolling was totally broken.

4:03

How long did the Python game test take?

easy Click to reveal answer

Around 6 minutes.

4:57

💡 Key Takeaways

💡

Impressive One-Shot Result

Demonstrates that a heavily quantized 80B model can produce a working 3D visualizer on consumer hardware.

2:20

📊

Comparison with Gemini Pro

Shows local model performance is not far from cloud model for moderate tasks.

3:12

⚖️

Practicality Verdict

Confirms that running large models locally is now practical for many tasks.

5:18

Full Transcript

Download .txt Download .md

[00:00] In previous video, we tested the limit

[00:02] of 8 gigs GPU by running a 30 billion

[00:04] models on it. In this video, we're going

[00:06] to test it even further by running a 80

[00:09] billion model on a 16 gigs card and then

[00:11] compare the result with Gemini 3.1 Pro.

[00:14] Will it works? Is it practical? We'll

[00:16] find out in this video.

[00:22] Couple weeks ago, Alibaba released a new

[00:24] coder model, Quen 3 Coder Next, and 80

[00:27] billion parameters, yet score really

[00:30] well, almost on par with models much

[00:32] larger than its size. Quen 3 coder Next

[00:35] also has only 3 billion active

[00:37] parameters, which should allow us to run

[00:39] it on consumer graphics card with

[00:40] reasonably fast speed. And today, we're

[00:42] going to see if we can fit this model

[00:44] into a RTX 5060 Ti 16 gigs. So, here is

[00:48] my portable AIP PC. 32 gigs of RAM and

[00:51] 16 gigs of VRAM. It's natural that we

[00:54] have to use a quantized version in order

[00:56] to run it. Looking at its size, at

[00:58] first, the 4-bit quant version seems

[01:01] like it will fit. However, we'll be left

[01:03] without any spare system memory to use

[01:06] and very limited context, so it's a bit

[01:08] impractical. The 3-bit Imatrix version

[01:11] with 33 gigs in size should give us

[01:13] plenty of breathing room with almost the

[01:15] same intelligent with Q3KM version. And

[01:18] if you use Unslo quantized version, they

[01:20] even did some benchmark showing that

[01:22] their 3bit Imatrix quant performed

[01:24] better than non-unslaught Q4 and almost

[01:26] on par with the full precision model.

[01:29] For the setting, I'm using 50K context

[01:31] length and offload around half of the

[01:33] model to GPU, leaving 1 gigs of free

[01:36] VRAM. For the systems, I use flash

[01:39] attention but did not use KV cache

[01:41] quantization since the model itself was

[01:44] quantized enough as it is already. So I

[01:46] want to avoid any other potential

[01:47] quality degradation if possible. So the

[01:50] first test I'll ask it to create a 3D

[01:52] audio visualizer web app using 3JS

[01:55] particle to react to the sound using a

[01:57] blank web project as a base.

[02:00] Despite being an 80 billion model

[02:02] running on 16 gigs card, the token

[02:04] generation speed is pretty fast. The

[02:07] model completed the task in around 10

[02:08] minutes and here is the result.

[02:20] It works beautifully. I'm really amazed

[02:23] I can get this result in one shot from

[02:25] local AI running on 16 gigs card. We can

[02:28] definitely polish it further, but for

[02:30] the first try, this is a very good

[02:32] result. So, the next test, we're going

[02:34] for a specific prompt. I found this good

[02:37] prompt example on X that he used it to

[02:39] test Gemini 3.1 Pro. So, let's use it to

[02:42] test with Quen 3 Next as well.

[02:49] The prompt again took only about 10

[02:51] minutes and the result is pretty good

[02:52] for the first attempt. The model handled

[02:54] most of the prompt correctly with the

[02:56] exception of some. For example, it

[02:58] missed the blur layer for the glass

[02:59] morphism navbar. Some of the animation

[03:01] are not working and minor visual bug.

[03:12] However, if we compare the result with

[03:14] Gemini 3.1 Pro, even though Gemini

[03:16] handled the animation a lot better, Quen

[03:18] 3 coder next result doesn't really look

[03:20] that far apart.

[03:30] So moving on to the next test, we'll add

[03:32] more complexity by giving the model very

[03:35] detailed concept such as 3JS particles,

[03:38] horizontal scrolling, several animation,

[03:41] and different section layout. This time

[03:43] the prompt took around 30 minutes to

[03:45] complete and here is the result. This

[03:48] time we start to see the limit of the

[03:50] heavily quantized 80 billion model with

[03:53] complex concept prompt. The first

[03:55] section, navbar and menu look quite okay

[03:58] at first, but the other sections are

[04:00] quite bad, especially the horizontal

[04:03] scrolling, which is totally broken.

[04:15] Gemini handled this test much better.

[04:47] The final test. We're switching from web

[04:50] development to Python. The prompt is to

[04:52] write a simple arcade spaceship shooter

[04:55] game. The model took around 6 minutes to

[04:57] complete the task. And here is the

[04:59] result. The game is playable at the

[05:02] first attempt. And since I didn't

[05:04] provide the sprite image, the model used

[05:06] simple rectangle for the spaceship

[05:07] instead. So I provided them and told the

[05:10] model to update the code. Here is the

[05:12] final result.

[05:18] So running an 80 billion model on a 16

[05:20] gig 5060 Ti isn't just a fun experiment,

[05:24] it's surprisingly practical. Even at

[05:26] 3bit quantized, Quentry Coder Next

[05:29] proved it can singleshot various tasks

[05:31] at moderate difficulty with impressive

[05:33] speed and accuracy. It still hits a wall

[05:36] on high complexity task which is where

[05:38] cloud model like Gemini Pro still hold

[05:41] the crown. But for local model on a

[05:43] consumer GPU, this is a massive leap

[05:45] forward. Let me know down in the

[05:47] comments what you want me to test next.

[05:49] And if you found this video helpful,

[05:50] like and subscribe for more. Thanks for

[05:53] watching. See you in the next one.

Red Stapler

Red Stapler

View channel analytics →

Topics #large language models #consumer gpu #qwen 3 coder #ai coding