---
title: 'NVIDIA New AI Is An Efficiency Monster'
source: 'https://youtube.com/watch?v=4wC8hnQawiA'
video_id: '4wC8hnQawiA'
date: 2026-06-28
duration_sec: 0
---

# NVIDIA New AI Is An Efficiency Monster

> Source: [NVIDIA New AI Is An Efficiency Monster](https://youtube.com/watch?v=4wC8hnQawiA)

## Summary

A new open, free AI model with 30 billion parameters processes images, video, and audio with unmatched throughput and cost efficiency. It achieves nearly 10x real-time video processing, up to 7x faster document processing, and is optimized for low-cost multimodal inference.

### Key Points

- **30 Billion Parameter Open AI Model** [0:00] — A new free AI model handling images, video, and audio, with 30 billion parameters.
- **Key Advantage: Throughput and Cost Efficiency** [0:10] — Compared to models like Gemma 4, this model excels in throughput and cost efficiency.
- **Video Processing Speed: ~10x Real-Time** [0:28] — Processes almost 10 hours of video per hour, nearly 10x real-time and 3x faster than Gwen 3 Omni.
- **Document Processing: Up to 7x Faster** [0:44] — When processing documents, the model gets up to 7 times faster than competitors.
- **Hardware Requirements: 25 GB VRAM** [0:52] — To run locally, need a beefy desktop GPU with ~25 GB video memory. Cloud version uses Lambda.
- **Innovation 1: Linear Memory Layers** [1:13] — Memory layers scale linearly with context length instead of quadratically, huge advantage for large documents/videos.
- **Innovation 2: Raw Audio to Tokens Preserving Emotion** [1:39] — Converts raw audio waves into tokens without stripping emotion/tone, cheaper than separate models like Whisper.
- **Innovation 3: 3D Convolutions on Video** [2:08] — Processes blocks of frames simultaneously (3D convolution) instead of frame-by-frame, compressing computation.
- **Innovation 4: Distilled CLIP Models** [2:43] — Three models (image-to-text, fine details, object segmentation) distilled into one small encoder network.
- **Innovation 5: Efficient Video Sampling** [3:19] — Removes duplicate information between frames (e.g., same background) to reduce data and computation.
- **License: Custom, 7/10 vs Apache 2.0** [3:47] — Custom license allows commercial use and derivative works with attribution, stricter on patents. Not Apache 2.0.
- **Limitations: Not #1 for Text/Code** [4:24] — For pure text reasoning or coding, look elsewhere. Excels in multimodal input (audio, video) at speed and cost.

### Conclusion

This new open AI model offers a breakthrough in multimodal efficiency—processing video, audio, and documents at remarkable speeds while being cost-effective to run, though it lags in pure text tasks.

## Transcript

Hmm, 30 billion parameters in a new open
free AI model where images, video, and
audio all work. Hmm, [clears throat]
why?
There are a bunch of other free systems
around in this area like the amazing
Gemma 4. So, what does this do better
than those? Two words, throughput and
cost efficiency. Okay, what does that
mean in practice? Now, hold on to your
papers, fellow scholars, because it
processes almost 10 hours of video per
hour. Whoo, that is nearly 10 times real
time. That is insanely quick. Wow,
almost three times faster than Gwen 3
Omni. And when processing documents, it
gets up to seven times faster. To run it
locally, you'll want something like this
or a beefy desktop GPU. We're talking
about 25 gigs of video memory, not
something you run on your phone. And to
run it in the cloud, I use Lambda. Okay,
so how did they do that? Where's the
magic sauce? Well, it does five things
really well and one thing not so well.
Dear fellow scholars, this is Two Minute
Papers with Dr. Károly Zsolnai-Fehér.
Well, one, member layers scale linearly
with context length instead of
quadratically. What does that mean?
Well, it means you throw everything you
got at it. The more documents you have,
the longer video or audio you have, the
bigger the advantage this one has. So,
if you're running something online that
processes those on a mass scale,
this is going to be incredible. Two,
when audio comes in, this side converts
raw audio waves into tokens, but
differently than elsewhere. Normally,
you have a speech recognition model
here. Those are often huge and expensive
and strip away all emotion and tone from
the input. But this one keeps all these
data and still does the job well. So
much cheaper than running a whole
separate model like Whisper on top.
Three, when you give it an image or
video, many previous generation
techniques smash it into a different
aspect ratio. This one keeps it. Then,
oh, look at this. Convolutions in 3D.
Now we're talking. Many other techniques
look at the video frame by frame. It
takes tons and tons of computation to
finish these videos. Here, the 3D
convolution looks at blocks of frames.
It looks at a package of frames at the
same time, and thus it can compress it a
great deal. Faster, cheaper. Four, now
that's really interesting, somewhat
unexpected. You would expect a huge
standalone CLIP model here. These
essentially predict what text would
match the image well. You need that
here, too. But, here's the trick. Not
one standalone CLIP model. Nope, this
one distills down three models. One for
matching images to text, one for fine
details, and one for object
segmentation. Now, all three of these
are smashed down into one small encoder
neural network. Once again, super
efficient. Five, efficient video
sampling. This is a good one. At this
point, we have thrown, let's say, a
video with 300 images into the neural
network. That's still a lot of data, but
it turns out not all frames are
completely unique. Many of them share
the same background, for instance. And
this one finally throws away this
duplicate information.
And it makes it,
you guessed it right, even cheaper and
more efficient. Okay, scholarly
question. So, what is the license
attached to it? What I would love to see
Apache 2.0, which is highly permissive,
and I don't see it here. It has its own
license. That's usually not great news,
but in this case, it's better than I
thought. Derivative works and commercial
use is fine. On the other hand, it needs
a bit of attribution and is a little
stricter on patent grants. If Apache 2.0
were a 10 out of 10, this is a seven out
of 10, in my opinion. And we don't shy
away from talking about limitations
here. So, anything else? Oh, yes.
If you're doing pure text reasoning or
pure coding, I would probably look
elsewhere. It is not the number one
smartest open model. No. But, if you
need multimodal input, like audio or
video, processed super fast and super
cheap, this is the one.
So, we now have free and open AI models
that we can own and run them ourselves,
which is only going to get more and more
important in the future. And since we
have so many models, they are starting
to specialize. They are becoming good in
different directions. So, better models
and more value for us fellow scholars,
for free.
Sign me up for that. Hugely appreciated.
What a time to be alive. Here you see me
running the full DeepSeek AI model
through Lambda GPU Cloud. 671
billion parameters, running super fast
and super reliably. This is insane. I
love it and I use it on a regular basis.
Lambda provides you with powerful Nvidia
GPUs to run your own chatbots and
experiments. Seriously, try it out now
at lambda.ai/papers
or click the link in the description.