[0:00] Hmm, why does this deep sea quirk exist?
[0:02] I mean, it adds vision capabilities to
[0:05] the deep sea AI system, but that's not
[0:07] new. A lot of other AI systems have
[0:10] vision capabilities. You just drop an
[0:12] image here and it works. Even video and
[0:15] even for open models. So, why do we need
[0:18] this paper? Well, they did something
[0:21] incredible here and it is an absolute
[0:24] game changer. Why? You see, if you ask a
[0:27] previous technique to count the number
[0:29] of people in this photo, it will think
[0:32] something like this. Okay, there are
[0:34] people on the upper left and a bunch of
[0:36] stripy guys in two rows. That is kind of
[0:40] three rows. Some of them are standing,
[0:42] some of them are sitting.
[0:44] Ah, it's just so confusing to just count
[0:46] them up using only words. Two problems
[0:49] with this one. One, this is prone to
[0:51] error. Two, you have to think a lot.
[0:55] Just describing stuff. Why? What would
[0:58] we, humans, do? Of course, we would use
[1:01] our finger and would point at the image.
[1:04] One, two, three, and so on.
[1:07] Done. Don't describe images like a poet.
[1:10] Point like a human. Now, that is exactly
[1:13] what this new technique does. It allows
[1:16] an AI system to point at things while
[1:19] thinking and it is absolutely brilliant.
[1:22] This makes it more accurate and it also
[1:25] makes it faster. In a world where
[1:27] hardware and tokens cost a fortune, it
[1:30] is fantastic to have something that
[1:33] gives us results faster and cheaper.
[1:35] But, it turns out thinking with visual
[1:38] primitives has even more advantages. It
[1:41] can also do topological reasoning. For
[1:43] instance, if you give it a maze with a
[1:46] start and end point, you not only get a
[1:49] correct answer to your questions, but
[1:51] you can also trace back the whole
[1:54] thought process visually.
[1:56] I love that. Also, here you can ask
[1:59] where the crown connects and look.
[2:02] To the octopus. Yeah, it answers
[2:05] correctly, but you can also see how it
[2:08] came to that conclusion. Now, make no
[2:11] mistake. These are simple examples. I'll
[2:14] show you in a moment if it is as good as
[2:16] these billion-dollar frontier models.
[2:19] Also, if something goes wrong, this will
[2:21] make it easier to find mistakes and fix
[2:24] them to create an even better model.
[2:26] This puts us one step closer to AI
[2:29] systems we can actually understand that
[2:32] do not just give us a soup of numbers.
[2:34] So good. So, how good is it? Well, hold
[2:38] on to your papers, fellow scholars, and
[2:41] I dropped my papers here. Look, it needs
[2:43] about 90% fewer visual tokens than most
[2:47] frontier models. Now, wait, wait, wait.
[2:50] It doesn't matter how little you think
[2:52] if you just say three as an answer
[2:55] without thinking. Thinking time doesn't
[2:57] matter if it is incorrect. So, how
[3:00] accurate is it?
[3:02] Are you kidding me? This free system
[3:04] matches or beats almost everything. And
[3:08] once again, we are talking about this,
[3:10] which is free, going up against
[3:12] billion-dollar systems here. Wow. Now,
[3:16] we are fellow scholars here, so at this
[3:18] point we ask, are these results real?
[3:21] You know, benchmarks are being gamed
[3:23] left and right. Now, here is what many
[3:26] people missed. Average over seven
[3:29] benchmarks, but in-house benchmarks
[3:31] excluded.
[3:33] That is the key. They did not rig their
[3:35] own benchmarks. You know why? Well,
[3:38] everyone loves it because it's one of
[3:40] the oldest tricks in the book. If you
[3:42] are not performing well, just create a
[3:45] new benchmark that fits you. Let's make
[3:47] a YUNUS benchmark. You will always be
[3:50] world first in being you. And this is
[3:53] not the case here. Amazing. This is free
[3:56] and open research. So, this technique
[3:58] can potentially be added to many
[4:00] existing models, including free ones.
[4:03] This paper does not have a model
[4:05] attached that I know of. It describes
[4:07] the concept of how to do it in detail.
[4:10] It's a blueprint, if you will. More
[4:12] intelligence for all of us for free.
[4:16] The world needs more papers like this.
[4:18] Love it. But, this all sounds like
[4:21] magic. How did they do this? Well, look,
[4:23] this is their own policy distillation
[4:26] objective. We need exactly this. You
[4:29] see, normally, we have a bunch of expert
[4:32] AI models. Now, at the risk of
[4:34] simplifying things, imagine that one of
[4:37] these guys is great at boxes. Nobody
[4:40] does boxes better than this guy. The
[4:42] other one is great at tracing mazes with
[4:45] points. But, that's not what we want.
[4:48] What we want is one AI that can do all
[4:51] of these things. And that is where this
[4:53] comes into play. We train a student
[4:56] model that learns from all of these
[4:58] teachers. It says what it would try to
[5:01] do, then the teachers say, "Okay, here's
[5:04] what I would have done." Do this enough
[5:06] and the student will be pretty good at
[5:09] all of these different kinds of visual
[5:11] thinking. This is why they used the name
[5:13] distilling the knowledge of a bunch of
[5:15] expert teachers into a student. So,
[5:19] where does this put us? Okay, so here's
[5:21] what I think. Dear fellow scholars, this
[5:23] is Two Minute Papers with Dr. Károly
[5:25] Zsolnai-Fehér. You know, we always
[5:27] thought that we would make AI systems
[5:29] smarter by giving it higher resolution
[5:32] images to train on. More pixels, more
[5:34] smarts. It turns out not true.
[5:37] Sometimes, that's not what we need at
[5:39] all. Deep Seek just cut down those
[5:42] visual tokens by 90% and still beat
[5:45] frontier models. Less is more. Now, is
[5:48] this perfect? All problems solved? No.
[5:52] Limitations. One, the AI does not
[5:55] automatically do this kind of pointy
[5:57] thinking. It needs a word as a cue for
[6:00] this kind of thinking. Two, bounding
[6:03] boxes are nice for people, but if you
[6:05] are counting blades of grass or strands
[6:08] of hair, now, in this case, not having
[6:11] those in very high resolution is a
[6:13] problem.
[6:15] >> [laughter]
[6:15] >> Yep, once again, the two-minute papers
[6:18] special, thin structures. Every time,
[6:22] man. It's so painful. And three, this
[6:25] kind of topological reasoning does not
[6:27] generalize as well as we'd like. It
[6:30] might not be as robust when you show it
[6:32] something completely new. So, careful
[6:35] with the misleading media headlines,
[6:37] careful with the hype everywhere. There
[6:39] is still plenty to improve here. But, I
[6:42] feel that this might be a breakthrough.
[6:44] And that makes it
[6:46] maybe the third one this month in AI
[6:48] research. What a time to be alive. Also,
[6:52] with large AI companies going to IPO,
[6:55] they are about to become ventures that
[6:57] look to maximize their profits. More
[6:59] money needed every quarter. So, it's
[7:02] going to become more and more crucial to
[7:05] own your own AI systems with free open
[7:07] weights models. And this one makes them
[7:10] better.
[7:11] Love it. Here you see me running the
[7:13] full DeepSeek AI model through Lambda
[7:16] GPU Cloud. 671
[7:20] billion parameters running super fast
[7:22] and super reliably. This is insane. I
[7:26] love it and I use it on a regular basis.
[7:29] Lambda provides you with powerful Nvidia
[7:32] GPUs to run your own chatbots and
[7:35] experiments. Seriously, try it out now
[7:37] at lambda.ai/papers
[7:40] or click the link in the description.