TubeSum ← Transcribe a video

DeepSeek’s New AI Is A Game Changer thumbnail

DeepSeek’s New AI Is A Game Changer

Transcribed Jun 28, 2026 Watch on YouTube ↗

Intermediate 4 min read For: AI researchers, machine learning engineers, and tech enthusiasts interested in efficient visual reasoning techniques.

319.2K

Views

11.1K

Likes

565

Comments

102

Dislikes

3.7%

📈 Moderate

AI Summary

This video explores a new DeepSeek paper that introduces a technique allowing AI to 'point' at objects in images while reasoning, similar to how humans use their fingers. This approach reduces visual tokens by 90% while matching or surpassing the performance of billion-dollar frontier models. The method, called policy distillation, trains a student model by learning from multiple expert models specialized in different visual tasks.

Chapters

1 The Problem: Describing Images Is Inefficient 0:00 2 The Solution: Pointing Like a Human 1:10 3 Performance: 90% Fewer Tokens, Matches Frontier Models 2:26 4 Validation: No Rigged Benchmarks 3:10 5 How It Works: Policy Distillation 4:10 6 Limitations and Future Potential 5:18

[0:00]

The Problem with Text-Based Visual Reasoning

Traditional AI describes images with words, which is error-prone and computationally expensive. The new technique uses visual primitives like bounding boxes and points, enabling more accurate and faster reasoning.

[1:10]

Pointing Like a Human

The AI can point at objects while thinking, like a human using a finger to count. This makes it more accurate and faster, reducing token usage by 90%.

[1:38]

Topological Reasoning and Transparency

The technique enables topological reasoning, e.g., tracing a path through a maze or identifying where a crown connects to an octopus, with a transparent thought process.

[3:02]

Performance: Matches Frontier Models

The free system matches or beats almost all frontier models on benchmarks, with in-house benchmarks excluded to avoid bias.

[4:23]

How It Works: Policy Distillation

The method uses policy distillation: training a student model by learning from multiple expert models, each specialized in different visual tasks (e.g., bounding boxes, maze tracing).

[5:48]

Limitations

Limitations include needing a word cue to initiate 'pointy thinking', struggles with thin structures like blades of grass, and limited generalization to completely new scenarios.

Clickbait Check

85% Legit

"The title accurately reflects the video's content: DeepSeek's new technique is indeed a game changer for visual reasoning, matching frontier models with far fewer tokens."

Mentioned in this Video

Lambda GPU Cloud

service

DeepSeek

tool

Dr. Károly Zsolnai-Fehér

person

Study Flashcards (5)

What is the key innovation of DeepSeek's new technique?

easy Click to reveal answer

It allows AI to point at objects in images while reasoning, like a human using a finger to count.

1:10

How many fewer visual tokens does the new technique require compared to frontier models?

medium Click to reveal answer

About 90% fewer visual tokens than most frontier models.

2:43

How does the new technique's accuracy compare to billion-dollar frontier models?

medium Click to reveal answer

It matches or beats almost everything, despite being free and open.

3:02

What training method does the paper use to combine expert models?

hard Click to reveal answer

Policy distillation—training a student model by learning from multiple expert models specialized in different visual tasks.

4:23

What are three limitations of the new technique?

hard Click to reveal answer

It needs a word cue to initiate 'pointy thinking', struggles with thin structures like blades of grass, and topological reasoning doesn't generalize well to completely new scenarios.

5:48

💡 Key Takeaways

🔧

Pointing Like a Human

Introduces a novel approach where AI uses visual primitives (bounding boxes, points) instead of text descriptions, improving accuracy and speed.

1:10

📊

90% Fewer Visual Tokens

Demonstrates a massive reduction in computational cost while maintaining or improving performance.

2:43

💡

Matches Frontier Models

A free, open technique competes with billion-dollar systems, highlighting the democratization of AI.

3:02

⚖️

Less Is More

Challenges the assumption that higher resolution is always better; cutting tokens by 90% still beats frontier models.

5:37

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

AI Learns to Point Like a Human

50s

The contrast between flawed AI description and human-like pointing is relatable and visually compelling, showcasing a breakthrough in AI reasoning.

▶ Play Clip

Free AI Beats Billion-Dollar Models

50s

The shocking claim that a free system matches or outperforms expensive frontier models triggers curiosity and debate, perfect for viral engagement.

▶ Play Clip

AI Benchmarks Exposed: No Rigging

50s

The revelation about benchmark manipulation and the paper's honest approach adds controversy and educational value, appealing to skeptics.

▶ Play Clip

Less is More: 90% Fewer Tokens

50s

The counterintuitive insight that reducing visual tokens improves performance sparks interest and challenges common assumptions about AI scaling.

▶ Play Clip

AI's Hidden Flaws: Thin Structures

50s

The humorous and relatable limitation about counting blades of grass humanizes the AI and adds authenticity, encouraging discussion.

▶ Play Clip

Full Transcript

Download .txt Download .md

[00:00] Hmm, why does this deep sea quirk exist?

[00:02] I mean, it adds vision capabilities to

[00:05] the deep sea AI system, but that's not

[00:07] new. A lot of other AI systems have

[00:10] vision capabilities. You just drop an

[00:12] image here and it works. Even video and

[00:15] even for open models. So, why do we need

[00:18] this paper? Well, they did something

[00:21] incredible here and it is an absolute

[00:24] game changer. Why? You see, if you ask a

[00:27] previous technique to count the number

[00:29] of people in this photo, it will think

[00:32] something like this. Okay, there are

[00:34] people on the upper left and a bunch of

[00:36] stripy guys in two rows. That is kind of

[00:40] three rows. Some of them are standing,

[00:42] some of them are sitting.

[00:44] Ah, it's just so confusing to just count

[00:46] them up using only words. Two problems

[00:49] with this one. One, this is prone to

[00:51] error. Two, you have to think a lot.

[00:55] Just describing stuff. Why? What would

[00:58] we, humans, do? Of course, we would use

[01:01] our finger and would point at the image.

[01:04] One, two, three, and so on.

[01:07] Done. Don't describe images like a poet.

[01:10] Point like a human. Now, that is exactly

[01:13] what this new technique does. It allows

[01:16] an AI system to point at things while

[01:19] thinking and it is absolutely brilliant.

[01:22] This makes it more accurate and it also

[01:25] makes it faster. In a world where

[01:27] hardware and tokens cost a fortune, it

[01:30] is fantastic to have something that

[01:33] gives us results faster and cheaper.

[01:35] But, it turns out thinking with visual

[01:38] primitives has even more advantages. It

[01:41] can also do topological reasoning. For

[01:43] instance, if you give it a maze with a

[01:46] start and end point, you not only get a

[01:49] correct answer to your questions, but

[01:51] you can also trace back the whole

[01:54] thought process visually.

[01:56] I love that. Also, here you can ask

[01:59] where the crown connects and look.

[02:02] To the octopus. Yeah, it answers

[02:05] correctly, but you can also see how it

[02:08] came to that conclusion. Now, make no

[02:11] mistake. These are simple examples. I'll

[02:14] show you in a moment if it is as good as

[02:16] these billion-dollar frontier models.

[02:19] Also, if something goes wrong, this will

[02:21] make it easier to find mistakes and fix

[02:24] them to create an even better model.

[02:26] This puts us one step closer to AI

[02:29] systems we can actually understand that

[02:32] do not just give us a soup of numbers.

[02:34] So good. So, how good is it? Well, hold

[02:38] on to your papers, fellow scholars, and

[02:41] I dropped my papers here. Look, it needs

[02:43] about 90% fewer visual tokens than most

[02:47] frontier models. Now, wait, wait, wait.

[02:50] It doesn't matter how little you think

[02:52] if you just say three as an answer

[02:55] without thinking. Thinking time doesn't

[02:57] matter if it is incorrect. So, how

[03:00] accurate is it?

[03:02] Are you kidding me? This free system

[03:04] matches or beats almost everything. And

[03:08] once again, we are talking about this,

[03:10] which is free, going up against

[03:12] billion-dollar systems here. Wow. Now,

[03:16] we are fellow scholars here, so at this

[03:18] point we ask, are these results real?

[03:21] You know, benchmarks are being gamed

[03:23] left and right. Now, here is what many

[03:26] people missed. Average over seven

[03:29] benchmarks, but in-house benchmarks

[03:31] excluded.

[03:33] That is the key. They did not rig their

[03:35] own benchmarks. You know why? Well,

[03:38] everyone loves it because it's one of

[03:40] the oldest tricks in the book. If you

[03:42] are not performing well, just create a

[03:45] new benchmark that fits you. Let's make

[03:47] a YUNUS benchmark. You will always be

[03:50] world first in being you. And this is

[03:53] not the case here. Amazing. This is free

[03:56] and open research. So, this technique

[03:58] can potentially be added to many

[04:00] existing models, including free ones.

[04:03] This paper does not have a model

[04:05] attached that I know of. It describes

[04:07] the concept of how to do it in detail.

[04:10] It's a blueprint, if you will. More

[04:12] intelligence for all of us for free.

[04:16] The world needs more papers like this.

[04:18] Love it. But, this all sounds like

[04:21] magic. How did they do this? Well, look,

[04:23] this is their own policy distillation

[04:26] objective. We need exactly this. You

[04:29] see, normally, we have a bunch of expert

[04:32] AI models. Now, at the risk of

[04:34] simplifying things, imagine that one of

[04:37] these guys is great at boxes. Nobody

[04:40] does boxes better than this guy. The

[04:42] other one is great at tracing mazes with

[04:45] points. But, that's not what we want.

[04:48] What we want is one AI that can do all

[04:51] of these things. And that is where this

[04:53] comes into play. We train a student

[04:56] model that learns from all of these

[04:58] teachers. It says what it would try to

[05:01] do, then the teachers say, "Okay, here's

[05:04] what I would have done." Do this enough

[05:06] and the student will be pretty good at

[05:09] all of these different kinds of visual

[05:11] thinking. This is why they used the name

[05:13] distilling the knowledge of a bunch of

[05:15] expert teachers into a student. So,

[05:19] where does this put us? Okay, so here's

[05:21] what I think. Dear fellow scholars, this

[05:23] is Two Minute Papers with Dr. Károly

[05:25] Zsolnai-Fehér. You know, we always

[05:27] thought that we would make AI systems

[05:29] smarter by giving it higher resolution

[05:32] images to train on. More pixels, more

[05:34] smarts. It turns out not true.

[05:37] Sometimes, that's not what we need at

[05:39] all. Deep Seek just cut down those

[05:42] visual tokens by 90% and still beat

[05:45] frontier models. Less is more. Now, is

[05:48] this perfect? All problems solved? No.

[05:52] Limitations. One, the AI does not

[05:55] automatically do this kind of pointy

[05:57] thinking. It needs a word as a cue for

[06:00] this kind of thinking. Two, bounding

[06:03] boxes are nice for people, but if you

[06:05] are counting blades of grass or strands

[06:08] of hair, now, in this case, not having

[06:11] those in very high resolution is a

[06:13] problem.

[06:15] >> [laughter]

[06:15] >> Yep, once again, the two-minute papers

[06:18] special, thin structures. Every time,

[06:22] man. It's so painful. And three, this

[06:25] kind of topological reasoning does not

[06:27] generalize as well as we'd like. It

[06:30] might not be as robust when you show it

[06:32] something completely new. So, careful

[06:35] with the misleading media headlines,

[06:37] careful with the hype everywhere. There

[06:39] is still plenty to improve here. But, I

[06:42] feel that this might be a breakthrough.

[06:44] And that makes it

[06:46] maybe the third one this month in AI

[06:48] research. What a time to be alive. Also,

[06:52] with large AI companies going to IPO,

[06:55] they are about to become ventures that

[06:57] look to maximize their profits. More

[06:59] money needed every quarter. So, it's

[07:02] going to become more and more crucial to

[07:05] own your own AI systems with free open

[07:07] weights models. And this one makes them

[07:10] better.

[07:11] Love it. Here you see me running the

[07:13] full DeepSeek AI model through Lambda

[07:16] GPU Cloud. 671

[07:20] billion parameters running super fast

[07:22] and super reliably. This is insane. I

[07:26] love it and I use it on a regular basis.

[07:29] Lambda provides you with powerful Nvidia

[07:32] GPUs to run your own chatbots and

[07:35] experiments. Seriously, try it out now

[07:37] at lambda.ai/papers

[07:40] or click the link in the description.