How LLMs Predict the Next Word
48sVisual explanation of parallel word prediction is fascinating and easy to understand.
▶ Play ClipThis video explains how large language models (LLMs) process a question and generate an answer, using the example of asking about Charles Darwin's most famous book. It describes the steps from tokenization and embeddings to parallel computation, attention mechanisms, and the role of specialized components like attention heads and multi-layer perceptrons.
A question is chopped into tokens (words), each mapped to a series of numbers called embeddings.
LLMs compute predictions for each word in parallel, visualizing which next word is expected at each step.
For the word 'what', the model quickly predicts 'is'; for 'Charles', it eventually settles on 'Dickens'.
After the prompt, the model predicts typical words following 'wrote', then narrows to 'the' based on context.
The model understands grammatical constructions like 'what is X's most famous Y' and knows that Darwin wrote 'On the Origin of Species'.
Attention heads and multi-layer perceptrons are the basic components; some specialize in grammar, others in facts.
Components communicate via queries, keys, and values—rows of numbers that allow information exchange.
Standard machine learning algorithms find optimal numbers to predict the next word by training on vast text.
LLMs generate answers one word at a time, combining language knowledge and facts through attention mechanisms across many layers. Scaling up these models has led to surprising capabilities.
"The title accurately describes the content; the video thoroughly explains how LLMs work."
What are the basic components of a large language model?
Attention heads and multi-layer perceptrons.
3:10
How does multi-head attention allow components to communicate?
Components send queries and keys; if they match, a value (information) is passed.
4:07
What is the task that large language models are trained to perform?
Predict the next word.
5:38
What are embeddings?
Series of numbers that each word is mapped to.
0:22
How do large language models generate answers?
One word at a time, using predictions from many layers and attention mechanisms.
5:54
Understanding Grammar and Facts
Demonstrates that LLMs combine grammatical knowledge and factual knowledge to generate appropriate answers.
2:31Multi-Head Attention Mechanism
Explains the key communication method between components using queries, keys, and values.
4:07Training on Text
Highlights that LLMs learn optimal parameters by processing vast amounts of text.
5:23Surprising Capabilities from Scaling
Notes that scaling up models leads to emergent abilities that surprise even researchers.
6:31[00:06] what happens when you enter a question
[00:09] into a chat bot for example suppose you
[00:12] ask what is Charles Darwin's most famous
[00:16] book the sentence you enter is chopped
[00:19] up in Parts in words or tokens and each
[00:22] word is mapped to a series of numbers
[00:24] called embeddings and with these
[00:27] embeddings large language models start
[00:29] to calc Cal at the computations for each
[00:32] word are happening in parallel to each
[00:34] other the following visualization of
[00:36] this process shows which next word is
[00:39] expected at each processing step for
[00:41] each word in
[00:42] parallel let's look at one column in
[00:45] this
[00:46] graph here we see that the large
[00:48] language model guesses almost
[00:50] immediately after seeing the word what
[00:53] that the next word will be is and as we
[00:56] move from the bottom start of the
[00:58] computation to the top the end of the
[01:01] computation this prediction is not
[01:03] changing
[01:05] much let's look at another column the
[01:08] one that processes the word Charles here
[01:11] we see that the large language model
[01:13] does not have much of a clue about what
[01:15] will come next but it eventually settles
[01:18] on
[01:19] Dickens that makes sense Charles Dickens
[01:22] might be the world's most famous
[01:25] Charles all the other columns are worth
[01:27] a look too they are cral for the proper
[01:30] functioning of the large language model
[01:32] as we will see later but their
[01:35] predictions are not really used because
[01:38] I as a user have already typed in the
[01:41] entire question in my prompt as well as
[01:43] the start of the answer Darwin wrote so
[01:47] let's have a look at what happens when
[01:49] the large language model is asked to
[01:51] take over and answer my
[01:54] question after receiving the word wrote
[01:58] the large language model in the first
[02:00] couple of computation steps predicts the
[02:02] typical words that may follow roote in
[02:05] English in it that or a but in the later
[02:11] steps in the computation the large
[02:12] language model has figured out that
[02:14] given the question that was asked the
[02:17] appropriate answer must start with
[02:20] the so how does this large language
[02:24] model at its 21st layer know it needs to
[02:28] predict the
[02:31] at that stage in processing it should
[02:33] have understood the query well enough to
[02:35] know that the appropriate answer should
[02:37] be the name of the book that is Charles
[02:39] Darwin's most famous one that means it
[02:42] must in some sense understand the
[02:44] grammatical construction what is X's
[02:47] most famous why and it should know that
[02:50] Darwin wrote On the Origin of Species
[02:53] and that that book is more famous than
[02:55] many other books that Darwin
[02:57] wrote so how do they actually do it well
[03:00] large language models are so large that
[03:03] it is very difficult in fact to tell
[03:05] exactly how they represent the knowledge
[03:07] that they have
[03:08] acquired but we do know how the basic
[03:10] components work that do all the work
[03:13] these basic components are called
[03:15] attention heads and multi-layer
[03:18] perceptrons some components are
[03:21] specialized in aspects of English
[03:22] grammar such as the s that marks that
[03:25] Darwin is the owner or author of the
[03:28] most famous book other components are
[03:31] specialized in higher order linguistic
[03:33] constructions such as what is X's most
[03:36] famous
[03:37] why yet other components store factual
[03:41] information one might be specialized in
[03:43] book titles another one might be
[03:46] specialized in finding author names in
[03:47] the
[03:48] input in current large language models
[03:51] there are for each word that is received
[03:54] or generated tens or thousands of these
[03:57] components working in parallel to
[03:59] properly process
[04:01] it these 10,000 components communicate
[04:05] with each other in various ways but one
[04:07] very crucial one is called multi-head
[04:09] attention and it allows components to
[04:12] ask other components for
[04:14] information such requests for
[04:16] information are called
[04:18] queries it also allows components to
[04:21] offer information to other components
[04:23] such messages are called
[04:25] keys and if keys and queries match the
[04:29] requested information then called value
[04:32] is passed
[04:34] on for instance we can imagine the
[04:37] component specialized in author names to
[04:40] send out a key that essentially means I
[04:42] have an author name on offer the
[04:45] component specialized in book titles
[04:47] might send out a query asking for author
[04:50] names and because key and query match
[04:53] the relevant information Charles Darwin
[04:56] is sent from the first to the second
[04:58] components where this in turn is sent to
[05:01] yet another component that has stored
[05:03] book titles associated with specific
[05:07] names what is important to know is that
[05:09] all these messages Keys queries values
[05:13] are just rows of
[05:15] numbers they are difficult to interpret
[05:17] by humans but the computer can quickly
[05:19] pass them from one component to the
[05:21] other and apply the required
[05:23] mathematical operations on
[05:25] them and the fact that they are just
[05:27] rows of numbers makes it Poss possible
[05:29] to run standard machine learning
[05:31] algorithms on them so that the computer
[05:33] can find the optimal set of numbers to
[05:35] perform a given task and in our case
[05:38] that given task is to predict the next
[05:41] word it finds that optimal set by going
[05:44] through an enormous amount of text but
[05:47] once it is strained large language
[05:49] models do no longer need access to the
[05:51] internet or the training set so large
[05:54] language models the models underlying
[05:56] chat Bots such as chat TPT work by by
[05:59] generating their answers one word at a
[06:01] time there are many layers in these
[06:03] models and in each layer predictions
[06:06] about the next word are computed
[06:08] information from all the words in the
[06:10] prompt as well as in the answer up to
[06:12] the current moment is combined using a
[06:14] mechanism called
[06:16] attention across the layers for all
[06:19] these words in parallel very accurate
[06:21] predictions are often formed combining
[06:24] knowledge about how language works and
[06:26] facts about the
[06:28] world the there's no magic here but by
[06:31] scaling up these models to enormous
[06:33] sizes they have acquired capabilities
[06:37] that have surprised the
[06:42] world
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.