TubeSum ← Transcribe a video

NVIDIA New AI Is An Efficiency Monster

Transcribed Jun 28, 2026 Watch on YouTube ↗
Intermediate 3 min read For: AI researchers, ML engineers, and developers interested in efficient multimodal model deployment.
54.9K
Views
1.9K
Likes
120
Comments
16
Dislikes
3.7%
📈 Moderate

AI Summary

A new open, free AI model with 30 billion parameters processes images, video, and audio with unmatched throughput and cost efficiency. It achieves nearly 10x real-time video processing, up to 7x faster document processing, and is optimized for low-cost multimodal inference.

[0:00]
30 Billion Parameter Open AI Model

A new free AI model handling images, video, and audio, with 30 billion parameters.

[0:10]
Key Advantage: Throughput and Cost Efficiency

Compared to models like Gemma 4, this model excels in throughput and cost efficiency.

[0:28]
Video Processing Speed: ~10x Real-Time

Processes almost 10 hours of video per hour, nearly 10x real-time and 3x faster than Gwen 3 Omni.

[0:44]
Document Processing: Up to 7x Faster

When processing documents, the model gets up to 7 times faster than competitors.

[0:52]
Hardware Requirements: 25 GB VRAM

To run locally, need a beefy desktop GPU with ~25 GB video memory. Cloud version uses Lambda.

[1:13]
Innovation 1: Linear Memory Layers

Memory layers scale linearly with context length instead of quadratically, huge advantage for large documents/videos.

[1:39]
Innovation 2: Raw Audio to Tokens Preserving Emotion

Converts raw audio waves into tokens without stripping emotion/tone, cheaper than separate models like Whisper.

[2:08]
Innovation 3: 3D Convolutions on Video

Processes blocks of frames simultaneously (3D convolution) instead of frame-by-frame, compressing computation.

[2:43]
Innovation 4: Distilled CLIP Models

Three models (image-to-text, fine details, object segmentation) distilled into one small encoder network.

[3:19]
Innovation 5: Efficient Video Sampling

Removes duplicate information between frames (e.g., same background) to reduce data and computation.

[3:47]
License: Custom, 7/10 vs Apache 2.0

Custom license allows commercial use and derivative works with attribution, stricter on patents. Not Apache 2.0.

[4:24]
Limitations: Not #1 for Text/Code

For pure text reasoning or coding, look elsewhere. Excels in multimodal input (audio, video) at speed and cost.

This new open AI model offers a breakthrough in multimodal efficiency—processing video, audio, and documents at remarkable speeds while being cost-effective to run, though it lags in pure text tasks.

Clickbait Check

95% Legit

"Title is accurate—the model excels in throughput and cost efficiency for multimodal data, though it's not the smartest for text/code."

Mentioned in this Video

Study Flashcards (8)

What is the key advantage of this new AI model compared to others like Gemma 4?

easy Click to reveal answer

Throughput and cost efficiency.

0:10

How many times faster is video processing compared to real-time?

easy Click to reveal answer

Nearly 10 times real-time (10 hours of video per hour).

0:28

How does memory layer scaling differ from previous approaches?

medium Click to reveal answer

Memory layers scale linearly with context length instead of quadratically.

1:13

How is raw audio processed differently?

medium Click to reveal answer

Raw audio waves are converted directly into tokens, preserving emotion and tone, without needing a separate speech recognition model like Whisper.

1:39

What technique does the model use for video processing that compresses computation?

hard Click to reveal answer

3D convolutions that look at blocks of frames simultaneously instead of frame-by-frame.

2:21

How are CLIP models handled in this architecture?

hard Click to reveal answer

Three models (image-to-text, fine details, object segmentation) are distilled into one small encoder network.

2:57

What is the license rating compared to Apache 2.0 (10/10)?

medium Click to reveal answer

7 out of 10. Allows commercial use and derivative works with attribution, but stricter on patents.

4:05

When should you NOT use this model?

easy Click to reveal answer

For pure text reasoning or pure coding, where it is not the smartest open model.

4:24

💡 Key Takeaways

📊

10x Real-Time Video Processing

Shows a dramatic speed improvement (almost 10 hours per hour) for video analysis, a key differentiator.

0:28
🔧

Linear Memory Scaling

Explains the underlying breakthrough—linear vs. quadratic scaling—that enables handling large contexts cheaply.

1:13
🔧

3D Convolutions for Video Speed

Demonstrates an elegant, efficient approach to video processing that compresses computation significantly.

2:21
🔧

Efficient Video Sampling Removes Duplicates

Highlights a practical optimization: discarding redundant inter-frame data reduces cost and increases speed.

3:40
📊

Multimodal Excellence, Not Best for Text

Clarifies the model's specific niche—multimodal speed and cost—and its limitations for pure text tasks.

4:24

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

10 Hours of Video Processed Per Hour!

35s

The insane speed improvement (10x real-time) creates a jaw-dropping 'wow' moment that viewers love to share.

▶ Play Clip

Linear Scaling Crushes Quadratic

31s

Explaining a key technical advantage (linear vs quadratic complexity) makes viewers feel smarter and sparks curiosity.

▶ Play Clip

Three Models Compressed Into One

33s

The trick of distilling three models into one is a clever efficiency hack that tech enthusiasts find mind-blowing.

▶ Play Clip

License Scored 7/10 - Not Too Bad

36s

Rating the license and comparing to Apache 2.0 provides a balanced, honest take that stands out in AI hype.

▶ Play Clip

Specialized Models: The Future

37s

The idea that open AI models are specializing into niches predicts an exciting future, encouraging engagement.

▶ Play Clip

[00:00] Hmm, 30 billion parameters in a new open

[00:04] free AI model where images, video, and

[00:07] audio all work. Hmm, [clears throat]

[00:09] why?

[00:10] There are a bunch of other free systems

[00:12] around in this area like the amazing

[00:15] Gemma 4. So, what does this do better

[00:18] than those? Two words, throughput and

[00:21] cost efficiency. Okay, what does that

[00:23] mean in practice? Now, hold on to your

[00:26] papers, fellow scholars, because it

[00:28] processes almost 10 hours of video per

[00:31] hour. Whoo, that is nearly 10 times real

[00:35] time. That is insanely quick. Wow,

[00:38] almost three times faster than Gwen 3

[00:41] Omni. And when processing documents, it

[00:44] gets up to seven times faster. To run it

[00:47] locally, you'll want something like this

[00:49] or a beefy desktop GPU. We're talking

[00:52] about 25 gigs of video memory, not

[00:55] something you run on your phone. And to

[00:58] run it in the cloud, I use Lambda. Okay,

[01:01] so how did they do that? Where's the

[01:03] magic sauce? Well, it does five things

[01:06] really well and one thing not so well.

[01:08] Dear fellow scholars, this is Two Minute

[01:10] Papers with Dr. Károly Zsolnai-Fehér.

[01:13] Well, one, member layers scale linearly

[01:16] with context length instead of

[01:18] quadratically. What does that mean?

[01:21] Well, it means you throw everything you

[01:23] got at it. The more documents you have,

[01:26] the longer video or audio you have, the

[01:28] bigger the advantage this one has. So,

[01:31] if you're running something online that

[01:33] processes those on a mass scale,

[01:36] this is going to be incredible. Two,

[01:39] when audio comes in, this side converts

[01:41] raw audio waves into tokens, but

[01:44] differently than elsewhere. Normally,

[01:47] you have a speech recognition model

[01:49] here. Those are often huge and expensive

[01:52] and strip away all emotion and tone from

[01:55] the input. But this one keeps all these

[01:58] data and still does the job well. So

[02:01] much cheaper than running a whole

[02:03] separate model like Whisper on top.

[02:05] Three, when you give it an image or

[02:08] video, many previous generation

[02:10] techniques smash it into a different

[02:12] aspect ratio. This one keeps it. Then,

[02:15] oh, look at this. Convolutions in 3D.

[02:21] Now we're talking. Many other techniques

[02:23] look at the video frame by frame. It

[02:26] takes tons and tons of computation to

[02:28] finish these videos. Here, the 3D

[02:31] convolution looks at blocks of frames.

[02:34] It looks at a package of frames at the

[02:36] same time, and thus it can compress it a

[02:39] great deal. Faster, cheaper. Four, now

[02:43] that's really interesting, somewhat

[02:45] unexpected. You would expect a huge

[02:48] standalone CLIP model here. These

[02:50] essentially predict what text would

[02:53] match the image well. You need that

[02:55] here, too. But, here's the trick. Not

[02:57] one standalone CLIP model. Nope, this

[03:00] one distills down three models. One for

[03:03] matching images to text, one for fine

[03:06] details, and one for object

[03:08] segmentation. Now, all three of these

[03:10] are smashed down into one small encoder

[03:14] neural network. Once again, super

[03:16] efficient. Five, efficient video

[03:19] sampling. This is a good one. At this

[03:21] point, we have thrown, let's say, a

[03:23] video with 300 images into the neural

[03:26] network. That's still a lot of data, but

[03:28] it turns out not all frames are

[03:31] completely unique. Many of them share

[03:33] the same background, for instance. And

[03:36] this one finally throws away this

[03:38] duplicate information.

[03:40] And it makes it,

[03:42] you guessed it right, even cheaper and

[03:44] more efficient. Okay, scholarly

[03:47] question. So, what is the license

[03:49] attached to it? What I would love to see

[03:52] Apache 2.0, which is highly permissive,

[03:55] and I don't see it here. It has its own

[03:58] license. That's usually not great news,

[04:00] but in this case, it's better than I

[04:02] thought. Derivative works and commercial

[04:05] use is fine. On the other hand, it needs

[04:07] a bit of attribution and is a little

[04:09] stricter on patent grants. If Apache 2.0

[04:13] were a 10 out of 10, this is a seven out

[04:16] of 10, in my opinion. And we don't shy

[04:19] away from talking about limitations

[04:21] here. So, anything else? Oh, yes.

[04:24] If you're doing pure text reasoning or

[04:27] pure coding, I would probably look

[04:29] elsewhere. It is not the number one

[04:32] smartest open model. No. But, if you

[04:35] need multimodal input, like audio or

[04:37] video, processed super fast and super

[04:40] cheap, this is the one.

[04:42] So, we now have free and open AI models

[04:45] that we can own and run them ourselves,

[04:48] which is only going to get more and more

[04:50] important in the future. And since we

[04:53] have so many models, they are starting

[04:56] to specialize. They are becoming good in

[04:58] different directions. So, better models

[05:01] and more value for us fellow scholars,

[05:04] for free.

[05:05] Sign me up for that. Hugely appreciated.

[05:08] What a time to be alive. Here you see me

[05:11] running the full DeepSeek AI model

[05:14] through Lambda GPU Cloud. 671

[05:18] billion parameters, running super fast

[05:21] and super reliably. This is insane. I

[05:25] love it and I use it on a regular basis.

[05:28] Lambda provides you with powerful Nvidia

[05:30] GPUs to run your own chatbots and

[05:33] experiments. Seriously, try it out now

[05:36] at lambda.ai/papers

[05:39] or click the link in the description.

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.