[0:00] Hmm, 30 billion parameters in a new open
[0:04] free AI model where images, video, and
[0:07] audio all work. Hmm, [clears throat]
[0:09] why?
[0:10] There are a bunch of other free systems
[0:12] around in this area like the amazing
[0:15] Gemma 4. So, what does this do better
[0:18] than those? Two words, throughput and
[0:21] cost efficiency. Okay, what does that
[0:23] mean in practice? Now, hold on to your
[0:26] papers, fellow scholars, because it
[0:28] processes almost 10 hours of video per
[0:31] hour. Whoo, that is nearly 10 times real
[0:35] time. That is insanely quick. Wow,
[0:38] almost three times faster than Gwen 3
[0:41] Omni. And when processing documents, it
[0:44] gets up to seven times faster. To run it
[0:47] locally, you'll want something like this
[0:49] or a beefy desktop GPU. We're talking
[0:52] about 25 gigs of video memory, not
[0:55] something you run on your phone. And to
[0:58] run it in the cloud, I use Lambda. Okay,
[1:01] so how did they do that? Where's the
[1:03] magic sauce? Well, it does five things
[1:06] really well and one thing not so well.
[1:08] Dear fellow scholars, this is Two Minute
[1:10] Papers with Dr. Károly Zsolnai-Fehér.
[1:13] Well, one, member layers scale linearly
[1:16] with context length instead of
[1:18] quadratically. What does that mean?
[1:21] Well, it means you throw everything you
[1:23] got at it. The more documents you have,
[1:26] the longer video or audio you have, the
[1:28] bigger the advantage this one has. So,
[1:31] if you're running something online that
[1:33] processes those on a mass scale,
[1:36] this is going to be incredible. Two,
[1:39] when audio comes in, this side converts
[1:41] raw audio waves into tokens, but
[1:44] differently than elsewhere. Normally,
[1:47] you have a speech recognition model
[1:49] here. Those are often huge and expensive
[1:52] and strip away all emotion and tone from
[1:55] the input. But this one keeps all these
[1:58] data and still does the job well. So
[2:01] much cheaper than running a whole
[2:03] separate model like Whisper on top.
[2:05] Three, when you give it an image or
[2:08] video, many previous generation
[2:10] techniques smash it into a different
[2:12] aspect ratio. This one keeps it. Then,
[2:15] oh, look at this. Convolutions in 3D.
[2:21] Now we're talking. Many other techniques
[2:23] look at the video frame by frame. It
[2:26] takes tons and tons of computation to
[2:28] finish these videos. Here, the 3D
[2:31] convolution looks at blocks of frames.
[2:34] It looks at a package of frames at the
[2:36] same time, and thus it can compress it a
[2:39] great deal. Faster, cheaper. Four, now
[2:43] that's really interesting, somewhat
[2:45] unexpected. You would expect a huge
[2:48] standalone CLIP model here. These
[2:50] essentially predict what text would
[2:53] match the image well. You need that
[2:55] here, too. But, here's the trick. Not
[2:57] one standalone CLIP model. Nope, this
[3:00] one distills down three models. One for
[3:03] matching images to text, one for fine
[3:06] details, and one for object
[3:08] segmentation. Now, all three of these
[3:10] are smashed down into one small encoder
[3:14] neural network. Once again, super
[3:16] efficient. Five, efficient video
[3:19] sampling. This is a good one. At this
[3:21] point, we have thrown, let's say, a
[3:23] video with 300 images into the neural
[3:26] network. That's still a lot of data, but
[3:28] it turns out not all frames are
[3:31] completely unique. Many of them share
[3:33] the same background, for instance. And
[3:36] this one finally throws away this
[3:38] duplicate information.
[3:40] And it makes it,
[3:42] you guessed it right, even cheaper and
[3:44] more efficient. Okay, scholarly
[3:47] question. So, what is the license
[3:49] attached to it? What I would love to see
[3:52] Apache 2.0, which is highly permissive,
[3:55] and I don't see it here. It has its own
[3:58] license. That's usually not great news,
[4:00] but in this case, it's better than I
[4:02] thought. Derivative works and commercial
[4:05] use is fine. On the other hand, it needs
[4:07] a bit of attribution and is a little
[4:09] stricter on patent grants. If Apache 2.0
[4:13] were a 10 out of 10, this is a seven out
[4:16] of 10, in my opinion. And we don't shy
[4:19] away from talking about limitations
[4:21] here. So, anything else? Oh, yes.
[4:24] If you're doing pure text reasoning or
[4:27] pure coding, I would probably look
[4:29] elsewhere. It is not the number one
[4:32] smartest open model. No. But, if you
[4:35] need multimodal input, like audio or
[4:37] video, processed super fast and super
[4:40] cheap, this is the one.
[4:42] So, we now have free and open AI models
[4:45] that we can own and run them ourselves,
[4:48] which is only going to get more and more
[4:50] important in the future. And since we
[4:53] have so many models, they are starting
[4:56] to specialize. They are becoming good in
[4:58] different directions. So, better models
[5:01] and more value for us fellow scholars,
[5:04] for free.
[5:05] Sign me up for that. Hugely appreciated.
[5:08] What a time to be alive. Here you see me
[5:11] running the full DeepSeek AI model
[5:14] through Lambda GPU Cloud. 671
[5:18] billion parameters, running super fast
[5:21] and super reliably. This is insane. I
[5:25] love it and I use it on a regular basis.
[5:28] Lambda provides you with powerful Nvidia
[5:30] GPUs to run your own chatbots and
[5:33] experiments. Seriously, try it out now
[5:36] at lambda.ai/papers
[5:39] or click the link in the description.