10 Hours of Video Processed Per Hour!
35sThe insane speed improvement (10x real-time) creates a jaw-dropping 'wow' moment that viewers love to share.
▶ Play ClipA new open, free AI model with 30 billion parameters processes images, video, and audio with unmatched throughput and cost efficiency. It achieves nearly 10x real-time video processing, up to 7x faster document processing, and is optimized for low-cost multimodal inference.
A new free AI model handling images, video, and audio, with 30 billion parameters.
Compared to models like Gemma 4, this model excels in throughput and cost efficiency.
Processes almost 10 hours of video per hour, nearly 10x real-time and 3x faster than Gwen 3 Omni.
When processing documents, the model gets up to 7 times faster than competitors.
To run locally, need a beefy desktop GPU with ~25 GB video memory. Cloud version uses Lambda.
Memory layers scale linearly with context length instead of quadratically, huge advantage for large documents/videos.
Converts raw audio waves into tokens without stripping emotion/tone, cheaper than separate models like Whisper.
Processes blocks of frames simultaneously (3D convolution) instead of frame-by-frame, compressing computation.
Three models (image-to-text, fine details, object segmentation) distilled into one small encoder network.
Removes duplicate information between frames (e.g., same background) to reduce data and computation.
Custom license allows commercial use and derivative works with attribution, stricter on patents. Not Apache 2.0.
For pure text reasoning or coding, look elsewhere. Excels in multimodal input (audio, video) at speed and cost.
This new open AI model offers a breakthrough in multimodal efficiency—processing video, audio, and documents at remarkable speeds while being cost-effective to run, though it lags in pure text tasks.
"Title is accurate—the model excels in throughput and cost efficiency for multimodal data, though it's not the smartest for text/code."
What is the key advantage of this new AI model compared to others like Gemma 4?
Throughput and cost efficiency.
0:10
How many times faster is video processing compared to real-time?
Nearly 10 times real-time (10 hours of video per hour).
0:28
How does memory layer scaling differ from previous approaches?
Memory layers scale linearly with context length instead of quadratically.
1:13
How is raw audio processed differently?
Raw audio waves are converted directly into tokens, preserving emotion and tone, without needing a separate speech recognition model like Whisper.
1:39
What technique does the model use for video processing that compresses computation?
3D convolutions that look at blocks of frames simultaneously instead of frame-by-frame.
2:21
How are CLIP models handled in this architecture?
Three models (image-to-text, fine details, object segmentation) are distilled into one small encoder network.
2:57
What is the license rating compared to Apache 2.0 (10/10)?
7 out of 10. Allows commercial use and derivative works with attribution, but stricter on patents.
4:05
When should you NOT use this model?
For pure text reasoning or pure coding, where it is not the smartest open model.
4:24
10x Real-Time Video Processing
Shows a dramatic speed improvement (almost 10 hours per hour) for video analysis, a key differentiator.
0:28Linear Memory Scaling
Explains the underlying breakthrough—linear vs. quadratic scaling—that enables handling large contexts cheaply.
1:133D Convolutions for Video Speed
Demonstrates an elegant, efficient approach to video processing that compresses computation significantly.
2:21Efficient Video Sampling Removes Duplicates
Highlights a practical optimization: discarding redundant inter-frame data reduces cost and increases speed.
3:40Multimodal Excellence, Not Best for Text
Clarifies the model's specific niche—multimodal speed and cost—and its limitations for pure text tasks.
4:24[00:00] Hmm, 30 billion parameters in a new open
[00:04] free AI model where images, video, and
[00:07] audio all work. Hmm, [clears throat]
[00:09] why?
[00:10] There are a bunch of other free systems
[00:12] around in this area like the amazing
[00:15] Gemma 4. So, what does this do better
[00:18] than those? Two words, throughput and
[00:21] cost efficiency. Okay, what does that
[00:23] mean in practice? Now, hold on to your
[00:26] papers, fellow scholars, because it
[00:28] processes almost 10 hours of video per
[00:31] hour. Whoo, that is nearly 10 times real
[00:35] time. That is insanely quick. Wow,
[00:38] almost three times faster than Gwen 3
[00:41] Omni. And when processing documents, it
[00:44] gets up to seven times faster. To run it
[00:47] locally, you'll want something like this
[00:49] or a beefy desktop GPU. We're talking
[00:52] about 25 gigs of video memory, not
[00:55] something you run on your phone. And to
[00:58] run it in the cloud, I use Lambda. Okay,
[01:01] so how did they do that? Where's the
[01:03] magic sauce? Well, it does five things
[01:06] really well and one thing not so well.
[01:08] Dear fellow scholars, this is Two Minute
[01:10] Papers with Dr. Károly Zsolnai-Fehér.
[01:13] Well, one, member layers scale linearly
[01:16] with context length instead of
[01:18] quadratically. What does that mean?
[01:21] Well, it means you throw everything you
[01:23] got at it. The more documents you have,
[01:26] the longer video or audio you have, the
[01:28] bigger the advantage this one has. So,
[01:31] if you're running something online that
[01:33] processes those on a mass scale,
[01:36] this is going to be incredible. Two,
[01:39] when audio comes in, this side converts
[01:41] raw audio waves into tokens, but
[01:44] differently than elsewhere. Normally,
[01:47] you have a speech recognition model
[01:49] here. Those are often huge and expensive
[01:52] and strip away all emotion and tone from
[01:55] the input. But this one keeps all these
[01:58] data and still does the job well. So
[02:01] much cheaper than running a whole
[02:03] separate model like Whisper on top.
[02:05] Three, when you give it an image or
[02:08] video, many previous generation
[02:10] techniques smash it into a different
[02:12] aspect ratio. This one keeps it. Then,
[02:15] oh, look at this. Convolutions in 3D.
[02:21] Now we're talking. Many other techniques
[02:23] look at the video frame by frame. It
[02:26] takes tons and tons of computation to
[02:28] finish these videos. Here, the 3D
[02:31] convolution looks at blocks of frames.
[02:34] It looks at a package of frames at the
[02:36] same time, and thus it can compress it a
[02:39] great deal. Faster, cheaper. Four, now
[02:43] that's really interesting, somewhat
[02:45] unexpected. You would expect a huge
[02:48] standalone CLIP model here. These
[02:50] essentially predict what text would
[02:53] match the image well. You need that
[02:55] here, too. But, here's the trick. Not
[02:57] one standalone CLIP model. Nope, this
[03:00] one distills down three models. One for
[03:03] matching images to text, one for fine
[03:06] details, and one for object
[03:08] segmentation. Now, all three of these
[03:10] are smashed down into one small encoder
[03:14] neural network. Once again, super
[03:16] efficient. Five, efficient video
[03:19] sampling. This is a good one. At this
[03:21] point, we have thrown, let's say, a
[03:23] video with 300 images into the neural
[03:26] network. That's still a lot of data, but
[03:28] it turns out not all frames are
[03:31] completely unique. Many of them share
[03:33] the same background, for instance. And
[03:36] this one finally throws away this
[03:38] duplicate information.
[03:40] And it makes it,
[03:42] you guessed it right, even cheaper and
[03:44] more efficient. Okay, scholarly
[03:47] question. So, what is the license
[03:49] attached to it? What I would love to see
[03:52] Apache 2.0, which is highly permissive,
[03:55] and I don't see it here. It has its own
[03:58] license. That's usually not great news,
[04:00] but in this case, it's better than I
[04:02] thought. Derivative works and commercial
[04:05] use is fine. On the other hand, it needs
[04:07] a bit of attribution and is a little
[04:09] stricter on patent grants. If Apache 2.0
[04:13] were a 10 out of 10, this is a seven out
[04:16] of 10, in my opinion. And we don't shy
[04:19] away from talking about limitations
[04:21] here. So, anything else? Oh, yes.
[04:24] If you're doing pure text reasoning or
[04:27] pure coding, I would probably look
[04:29] elsewhere. It is not the number one
[04:32] smartest open model. No. But, if you
[04:35] need multimodal input, like audio or
[04:37] video, processed super fast and super
[04:40] cheap, this is the one.
[04:42] So, we now have free and open AI models
[04:45] that we can own and run them ourselves,
[04:48] which is only going to get more and more
[04:50] important in the future. And since we
[04:53] have so many models, they are starting
[04:56] to specialize. They are becoming good in
[04:58] different directions. So, better models
[05:01] and more value for us fellow scholars,
[05:04] for free.
[05:05] Sign me up for that. Hugely appreciated.
[05:08] What a time to be alive. Here you see me
[05:11] running the full DeepSeek AI model
[05:14] through Lambda GPU Cloud. 671
[05:18] billion parameters, running super fast
[05:21] and super reliably. This is insane. I
[05:25] love it and I use it on a regular basis.
[05:28] Lambda provides you with powerful Nvidia
[05:30] GPUs to run your own chatbots and
[05:33] experiments. Seriously, try it out now
[05:36] at lambda.ai/papers
[05:39] or click the link in the description.
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.