[0:00] Hmm, 30 billion parameters in a new open [0:04] free AI model where images, video, and [0:07] audio all work. Hmm, [clears throat] [0:09] why? [0:10] There are a bunch of other free systems [0:12] around in this area like the amazing [0:15] Gemma 4. So, what does this do better [0:18] than those? Two words, throughput and [0:21] cost efficiency. Okay, what does that [0:23] mean in practice? Now, hold on to your [0:26] papers, fellow scholars, because it [0:28] processes almost 10 hours of video per [0:31] hour. Whoo, that is nearly 10 times real [0:35] time. That is insanely quick. Wow, [0:38] almost three times faster than Gwen 3 [0:41] Omni. And when processing documents, it [0:44] gets up to seven times faster. To run it [0:47] locally, you'll want something like this [0:49] or a beefy desktop GPU. We're talking [0:52] about 25 gigs of video memory, not [0:55] something you run on your phone. And to [0:58] run it in the cloud, I use Lambda. Okay, [1:01] so how did they do that? Where's the [1:03] magic sauce? Well, it does five things [1:06] really well and one thing not so well. [1:08] Dear fellow scholars, this is Two Minute [1:10] Papers with Dr. Károly Zsolnai-Fehér. [1:13] Well, one, member layers scale linearly [1:16] with context length instead of [1:18] quadratically. What does that mean? [1:21] Well, it means you throw everything you [1:23] got at it. The more documents you have, [1:26] the longer video or audio you have, the [1:28] bigger the advantage this one has. So, [1:31] if you're running something online that [1:33] processes those on a mass scale, [1:36] this is going to be incredible. Two, [1:39] when audio comes in, this side converts [1:41] raw audio waves into tokens, but [1:44] differently than elsewhere. Normally, [1:47] you have a speech recognition model [1:49] here. Those are often huge and expensive [1:52] and strip away all emotion and tone from [1:55] the input. But this one keeps all these [1:58] data and still does the job well. So [2:01] much cheaper than running a whole [2:03] separate model like Whisper on top. [2:05] Three, when you give it an image or [2:08] video, many previous generation [2:10] techniques smash it into a different [2:12] aspect ratio. This one keeps it. Then, [2:15] oh, look at this. Convolutions in 3D. [2:21] Now we're talking. Many other techniques [2:23] look at the video frame by frame. It [2:26] takes tons and tons of computation to [2:28] finish these videos. Here, the 3D [2:31] convolution looks at blocks of frames. [2:34] It looks at a package of frames at the [2:36] same time, and thus it can compress it a [2:39] great deal. Faster, cheaper. Four, now [2:43] that's really interesting, somewhat [2:45] unexpected. You would expect a huge [2:48] standalone CLIP model here. These [2:50] essentially predict what text would [2:53] match the image well. You need that [2:55] here, too. But, here's the trick. Not [2:57] one standalone CLIP model. Nope, this [3:00] one distills down three models. One for [3:03] matching images to text, one for fine [3:06] details, and one for object [3:08] segmentation. Now, all three of these [3:10] are smashed down into one small encoder [3:14] neural network. Once again, super [3:16] efficient. Five, efficient video [3:19] sampling. This is a good one. At this [3:21] point, we have thrown, let's say, a [3:23] video with 300 images into the neural [3:26] network. That's still a lot of data, but [3:28] it turns out not all frames are [3:31] completely unique. Many of them share [3:33] the same background, for instance. And [3:36] this one finally throws away this [3:38] duplicate information. [3:40] And it makes it, [3:42] you guessed it right, even cheaper and [3:44] more efficient. Okay, scholarly [3:47] question. So, what is the license [3:49] attached to it? What I would love to see [3:52] Apache 2.0, which is highly permissive, [3:55] and I don't see it here. It has its own [3:58] license. That's usually not great news, [4:00] but in this case, it's better than I [4:02] thought. Derivative works and commercial [4:05] use is fine. On the other hand, it needs [4:07] a bit of attribution and is a little [4:09] stricter on patent grants. If Apache 2.0 [4:13] were a 10 out of 10, this is a seven out [4:16] of 10, in my opinion. And we don't shy [4:19] away from talking about limitations [4:21] here. So, anything else? Oh, yes. [4:24] If you're doing pure text reasoning or [4:27] pure coding, I would probably look [4:29] elsewhere. It is not the number one [4:32] smartest open model. No. But, if you [4:35] need multimodal input, like audio or [4:37] video, processed super fast and super [4:40] cheap, this is the one. [4:42] So, we now have free and open AI models [4:45] that we can own and run them ourselves, [4:48] which is only going to get more and more [4:50] important in the future. And since we [4:53] have so many models, they are starting [4:56] to specialize. They are becoming good in [4:58] different directions. So, better models [5:01] and more value for us fellow scholars, [5:04] for free. [5:05] Sign me up for that. Hugely appreciated. [5:08] What a time to be alive. Here you see me [5:11] running the full DeepSeek AI model [5:14] through Lambda GPU Cloud. 671 [5:18] billion parameters, running super fast [5:21] and super reliably. This is insane. I [5:25] love it and I use it on a regular basis. [5:28] Lambda provides you with powerful Nvidia [5:30] GPUs to run your own chatbots and [5:33] experiments. Seriously, try it out now [5:36] at lambda.ai/papers [5:39] or click the link in the description.