What's inside an AI model file?
44sReveals the surprising truth that AI models are just static lists of numbers, sparking curiosity.
▶ Play ClipThe video explains what the 'B' in AI model names like '8B' or '70B' means: it refers to the number of parameters, which are floating-point numbers that store the model's learned knowledge. Parameters act as knobs adjusted during training to capture language patterns, and their count impacts capability and hardware requirements.
An AI model file contains a static list of floating-point numbers called parameters. '8B' means 8 billion such numbers.
Most parameters are weights (resistors controlling signal flow) and biases (baseline offsets). Initially random, they are tuned during training.
Training involves predicting next words, generating error signals, and using backpropagation to adjust all parameters trillions of times.
Frozen parameters are arranged in layers: embedding (text to vectors), attention (context mapping), and feed-forward network (final prediction).
More parameters (e.g., 70B) capture more patterns but require more memory. 8B fits on one GPU; 70B needs server arrays.
MoE activates only a fraction of parameters per token (e.g., 37B of 600B), saving compute while maintaining capability.
Training data quality and architecture efficiency often matter more than raw parameter count for real-world performance.
AI models are essentially massive lists of calibrated numbers that execute matrix multiplications to predict text. The 'B' indicates parameter count, which influences capability and hardware needs, but quality depends on training and design.
"The title accurately promises an explanation of '8B' and '70B', and the video delivers a clear, detailed guide."
What does the 'B' in '8B' or '70B' stand for?
Billions of parameters (floating-point numbers).
What are the two main types of parameters in an AI model?
Weights (control signal strength) and biases (provide baseline offset).
00:47
How are parameters initially set before training?
To small random numbers.
01:06
What process adjusts parameters during training?
Backpropagation, which traces error backward through connections to nudge parameters.
01:30
What are the three main layers of a transformer architecture?
Embedding layer, attention mechanism, and feed-forward network.
01:58
What is the advantage of a 70B parameter model over an 8B model?
It can absorb more obscure patterns and complex linguistic structures.
02:24
What hardware limitation does a 70B parameter model face?
It requires massive server arrays because it doesn't fit on a single high-end GPU.
02:40
How does Mixture of Experts (MoE) save compute?
By activating only a fraction of total parameters per token (e.g., 37B out of 600B).
02:50
Why is parameter count a poor metric for model quality?
Training data quality and architectural efficiency often matter more.
03:12
Parameters are floating-point numbers
Reveals that AI models are simply lists of decimals, not code or encyclopedias.
Training via prediction and backpropagation
Explains the core learning mechanism in a simple, intuitive way.
01:16Mixture of Experts as a workaround
Shows a practical solution to hardware constraints, relevant for modern AI scaling.
02:50Parameter count isn't everything
Important nuance that counters the 'bigger is better' mindset.
03:12[00:00] You click download on a new AI model,
[00:03] maybe it's labeled Quen 8B or Llama 32B,
[00:08] and watch dozens of gigabytes stream
[00:10] onto your hard drive. That massive file
[00:13] contains no encyclopedias of human
[00:15] knowledge and no lines of logical
[00:17] programming code. If you open up that
[00:20] file, you will find a single static list
[00:23] of floatingoint numbers. The letter B
[00:26] tells you exactly how many of those
[00:28] numbers are in the file. In machine
[00:30] learning, we call each of these
[00:32] individual values a parameter. This
[00:34] raises a difficult question. How does a
[00:37] massive unmoving list of decimals
[00:39] possess the ability to write Python code
[00:42] or reason through problems it has never
[00:44] seen before? Think of a parameter as a
[00:47] physical control knob that adjusts the
[00:49] flow of a signal. On its own, one knob
[00:52] does almost nothing. Most of these are
[00:54] weights which act like resistors to
[00:56] decide how strongly one piece of data
[00:58] influences another. The rest are biases
[01:01] which provide a baseline offset for the
[01:04] systems calculations. When an AI model
[01:06] is first created, all 8 billion
[01:09] parameters are set to small random
[01:11] numbers. At this stage, any prompt you
[01:14] give it results in complete gibberish.
[01:16] Training begins by feeding the model an
[01:18] incomplete sentence and forcing it to
[01:21] use those random numbers to predict the
[01:23] next word. When it predicts table
[01:25] instead of Matt, the system generates an
[01:28] error signal. Back propagation traces
[01:30] that error backward through every
[01:32] connection, calculating how to nudge
[01:34] those 8 billion knobs to make the
[01:36] prediction more accurate next time.
[01:39] After trillions of corrections over
[01:41] massive data sets, these numbers settle
[01:43] into a highly tuned pattern that
[01:45] captures the structures of language.
[01:47] Once training is finished, the
[01:49] parameters are frozen. The intelligence
[01:51] of the model is a record of those
[01:53] trillions of past corrections. These
[01:55] frozen parameters occupy the transformer
[01:58] architecture. At the bottom, the
[02:00] embedding layer translates human text
[02:02] into mathematical vectors. Next, the
[02:04] attention mechanism maps out context,
[02:07] identifying relationships between words.
[02:10] Finally, the massive feed forward
[02:12] network transforms these representations
[02:14] into answers. It is a structured
[02:16] assembly line designed to convert
[02:18] language into math, weigh the context,
[02:21] and compute a statistical prediction. A
[02:24] 70 billion parameter model can absorb
[02:26] more obscure patterns and complex
[02:28] linguistic structures than an 8 billion
[02:30] parameter model. However, every
[02:32] parameter occupies physical space on
[02:35] hardware memory. Each number must be
[02:37] stored and processed on chips like the
[02:39] GH100.
[02:40] While an 8 billion parameter model fits
[02:42] on a single high-end graphics card, a 70
[02:45] billion parameter model requires these
[02:47] massive server arrays just to function.
[02:50] The mixture of experts architecture
[02:52] serves as a modern work for these
[02:54] hardware constraints. By activating only
[02:56] a fraction of its total parameters, for
[02:58] example, 37 billion out of 600 billion
[03:01] per word generated, the model saves
[03:04] significant compute time. Scaling AI
[03:06] capability has become a direct
[03:08] confrontation with the physical
[03:10] limitations of hardware memory.
[03:12] Parameter count alone is a poor metric
[03:14] for quality. Training data and
[03:16] architectural efficiency often determine
[03:18] which model actually performs best in
[03:21] the real world. At its fundamental core,
[03:23] the text generation process consists of
[03:26] billions of carefully calibrated values
[03:28] executing rapid matrix multiplications.
[03:31] Running a language model means executing
[03:34] a machine's learned skill. Billions of
[03:36] numerical adjustments frozen into a
[03:38] file, performing the math necessary to
[03:41] predict the next logical step in a
[03:43] sequence.
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.