TubeSum ← Transcribe a video

Training Sand to Think: Artificial General Intelligence & Future of Physics

Transcribed Jun 15, 2026 Watch on YouTube ↗
Intermediate 11 min read For: General audience with some technical background; physicists, mathematicians, and AI enthusiasts.
88.1K
Views
2.0K
Likes
266
Comments
93
Dislikes
2.6%
📈 Moderate

AI Summary

The speaker, a theoretical physicist, discusses the transformative impact of large language models (LLMs) on mathematics and physics. He explains how LLMs have evolved from preschool-level performance to surpassing PhD experts in exams, and have recently achieved novel mathematical research, including solving a major open problem. He argues that even without further progress, LLMs will revolutionize physics, and with continued scaling and algorithmic improvements, they will lead to a golden era of scientific discovery.

[00:03]
Extraordinary Moment in History

We've figured out how to refine sand into silicon, turn it into chips, assemble them into neural networks, and train them to think.

[00:29]
Physicist's Shift to AI

The speaker stopped writing theoretical physics papers to contribute to building machines that produce knowledge on an industrial scale.

[01:18]
LLMs as General Intelligence

Large language models are not just special-purpose tools; they can do every part of a theoretical physicist's job, acting as a general intelligence.

[02:52]
How LLMs Work

LLMs are neural networks inspired by the human brain, grown rather than programmed, trained by predicting the next word in text.

[03:10]
Scale of LLMs

LLMs have grown from about a billion parameters in 2020 to a few trillion today, still short of the 100 trillion synapses in the human brain.

[05:37]
Pre-training and Post-training

Pre-training involves predicting the next word on the internet; post-training refines the model to be helpful and polite.

[07:04]
Scaling Laws

Physicists discovered scaling laws for LLMs: performance improves predictably with more compute, leading to the scaling era.

[09:01]
Scaling Law Graph

A log-log plot shows that spending more compute on training yields linear improvement in performance, a key insight for investors.

[11:57]
Drivers of Progress

The main driver is algorithmic progress, followed by scaling compute and money; Moore's law is a minor factor.

[14:48]
Early LLM Performance

In 2019, LLMs performed at preschool level; on the MATH benchmark, they scored 6% in 2021, while humans scored 40-90%.

[18:40]
Rapid Improvement

Prediction markets expected 50% by 2025, but LLMs reached 50% almost immediately and 90% by mid-2024, then near-perfect scores.

[20:44]
Techniques for Improvement

Key techniques include scale, better data, chain-of-thought prompting, reinforcement learning for long thinking, and multi-LLM conversations.

[26:11]
Graduate-Level Science

On the GPQA benchmark (graduate-level science), LLMs went from random guessing to perfect scores within 18 months.

[29:07]
Private Test Set

The speaker's own graduate exams from Stanford were solved with 100% accuracy by LLMs within 18 months.

[30:47]
International Math Olympiad

In 2024, LLMs achieved a gold medal score (5/6 problems) at the IMO, with solutions praised as clear and elegant.

[35:52]
Novel Mathematical Research

In late 2024, a centaur-style collaboration produced a novel proof that a top mathematician called 'the kind of insight I would have been proud to have produced myself.'

[47:28]
First Major Breakthrough

In 2026, an LLM solved the unit distance conjecture, a major open problem, marking the first major AI-generated mathematical breakthrough.

[50:55]
Chess Analogy

LLMs in science may follow a similar trajectory to chess computers: toy, tool, centaur, then superhuman, with AI becoming the dominant scientist.

[53:38]
Golden Era of Physics

Even without further progress, LLMs will revolutionize physics; with continued improvement, we'll have billions of AI Einsteins, leading to a golden era.

Large language models have rapidly advanced from preschool-level performance to surpassing PhD experts and making novel research breakthroughs. This progress, driven by scaling and algorithmic improvements, promises a golden era for physics and mathematics, with AI becoming an indispensable collaborator in scientific discovery.

Clickbait Check

85% Legit

"The title accurately reflects the content: the talk covers how LLMs (trained sand) achieve AGI-level reasoning and transform physics."

Mentioned in this Video

Study Flashcards (12)

What is the key difference between traditional computer programs and neural networks?

easy Click to reveal answer

Neural networks are grown, not programmed; they start with random weights and are trained by adjusting pathways based on prediction accuracy.

03:41

How many parameters did the largest LLMs have at the beginning of the decade, and how many do they have now?

easy Click to reveal answer

About a billion parameters at the beginning of the decade; now up to a few trillion.

03:10

What is the 'bitter lesson' in AI?

medium Click to reveal answer

The bitter lesson is that scaling up systems (more compute, data, parameters) often outperforms human-designed clever algorithms.

21:41

What technique involves asking an LLM to 'think step-by-step' before answering?

easy Click to reveal answer

Chain-of-thought prompting.

22:48

What was the first major open problem solved by an LLM?

hard Click to reveal answer

The unit distance conjecture, a problem posed by mathematician Erdos.

47:28

What is the estimated cost of an Avogadro flop in today's money?

medium Click to reveal answer

About a million dollars.

13:45

What are the two stages of training an LLM?

medium Click to reveal answer

Pre-training (predicting the next word on the internet) and post-training (refining for helpfulness and politeness).

05:37

What score did LLMs achieve on the MATH benchmark in 2021, and what did they achieve by mid-2024?

medium Click to reveal answer

6% in 2021; 90% by mid-2024.

16:48

What is the name of the benchmark for graduate-level science questions?

easy Click to reveal answer

GPQA.

26:21

What is the 'centaur' model in AI research?

medium Click to reveal answer

A human working collaboratively with an AI (like a large language model) to do research.

36:16

How many times faster are LLMs progressing compared to human students?

hard Click to reveal answer

About four times as fast; for every year that passes, they advance four years into the future.

19:59

What is the main driver of progress in LLMs over the last decade?

medium Click to reveal answer

Algorithmic progress (human ingenuity) is the number one contributor.

12:48

💡 Key Takeaways

💡

Turning Sand to Thought

Captures the essence of the talk: the journey from raw materials to artificial intelligence.

00:03
💡

LLMs as General Intelligence

Key claim that LLMs can replace all parts of a physicist's job, not just special-purpose tools.

01:18
📊

Scaling Laws from Physics

Physicists discovered that LLM performance improves predictably with scale, driving investment and progress.

07:04
⚖️

The Bitter Lesson

Explains why scaling often beats human-designed algorithms, a key principle in AI.

21:41
📊

LLMs Win Math Olympiad Gold

Demonstrates that LLMs can achieve creative problem-solving at the highest level of high school mathematics.

30:47
📊

First Major AI Breakthrough

LLM solves a famous open problem, marking a milestone in AI mathematics.

47:28
💡

Billions of AI Einsteins

Envisions a future where cheap, superhuman AI scientists revolutionize physics.

53:38

✂️ Creator Tools: Viral Hooks

AI-generated clip ideas for Shorts based on the transcript

Training sand to think: AGI explained

48s

The opening metaphor of refining sand into silicon and training it to think is visually striking and immediately hooks viewers with the promise of explaining AGI.

▶ Play Clip

LLMs beat PhDs in science exams

60s

The claim that LLMs now score 100% on graduate-level science exams challenges assumptions about machine intelligence and sparks debate.

▶ Play Clip

AI solves International Math Olympiad

60s

The milestone of an AI winning a gold medal in the IMO, a benchmark of human creativity, is a dramatic and shareable achievement.

▶ Play Clip

First major AI math breakthrough

60s

The announcement that an LLM solved a long-standing open problem (Erdős's unit distance conjecture) is a concrete, historic achievement that excites both experts and the public.

▶ Play Clip

Why LLMs will keep getting smarter

60s

The speaker's confident prediction that current methods are sufficient for AGI, backed by scaling laws and algorithmic progress, is both provocative and newsworthy.

▶ Play Clip

[00:01] [applause]

[00:03] >> Okay, thank you. Absolutely delighted to

[00:05] be here. We live at an extraordinary

[00:08] moment in our civilization's history.

[00:11] We have collectively figured out how to

[00:13] turn to refine sand into silicon, then

[00:16] take that silicon and turn it into

[00:18] silicon chips, then assemble those

[00:20] silicon chips into neural networks, and

[00:23] now how to train those neural networks

[00:25] to think.

[00:27] So, I've written about 40 theoretical

[00:29] physics papers in my career so far, but

[00:32] I've stopped. And I've stopped cuz it

[00:35] self felt like too much of a guilty

[00:36] pleasure to handwrite

[00:39] theoretical physics papers one by one,

[00:41] when what I should be doing is

[00:42] contributing to the

[00:44] production of a machine that is going to

[00:46] spew out knowledge on an industrial

[00:48] scale.

[00:49] We've of course had for many years now a

[00:52] uh computer assistance in doing physics,

[00:56] going back to uh the invention of the

[00:58] pocket calculator, or perhaps even

[01:00] further back to the abacus. Uh

[01:04] This This one is different. Those Those

[01:06] are special purpose tools that we've

[01:08] been using for particular parts of the

[01:11] physics enterprise.

[01:14] They uh help you as one step, and you

[01:16] have to do the rest.

[01:18] What's new is something that we didn't

[01:21] know at the beginning of the decade, but

[01:24] those of us who live in San Francisco

[01:25] certainly think we know now, which is

[01:27] that we know about the large language

[01:30] model. And a large language model is has

[01:33] the capability not just to be a special

[01:35] purpose tool that replaces one part of

[01:39] of the stack, but in fact do every

[01:41] single part of my job as a theoretical

[01:44] physicist. It is a general intelligence,

[01:46] and we think that large language models

[01:48] will be the substrate on which we build

[01:50] these general intelligences.

[01:53] Uh what I'm going to tell you about

[01:54] today is using large language models to

[01:57] do

[01:58] maths and physics. I'm going to tell you

[02:00] about the recent past of this process

[02:03] and the successes we've seen, the

[02:04] extraordinary progress indeed that we've

[02:06] seen over the last half decade. I'm

[02:09] going to tell you where we are today and

[02:11] I'm going to tell you a little bit about

[02:12] where I think we're going.

[02:14] Uh but first of all, uh I should remind

[02:16] you what a large language model is.

[02:19] Um I hope uh you by now have have used

[02:21] one. You can just use one. You can just

[02:23] go to one of these websites and just

[02:25] start talking to them and it'll talk

[02:26] back. Uh it'll talk back in a way that

[02:29] quietly passed the Turing test a couple

[02:31] of years ago and nobody nobody really

[02:33] celebrated it. So, we have Gemini, which

[02:36] is the one that I contribute to.

[02:39] Um and also some others, ChatGPT,

[02:42] Claude, many other options out there, uh

[02:45] all pushing the frontier of machine

[02:47] intelligence.

[02:50] Um at base, a large language model is a

[02:52] kind of neural network. It is an

[02:54] artificial

[02:56] computing device

[02:58] inspired by the human brain, inspired by

[03:00] the arrangement of the neurons in the

[03:01] human brain.

[03:03] Uh and therefore quite unlike uh

[03:05] traditional computer programs.

[03:07] Um

[03:09] At the beginning of the decade, the

[03:10] largest

[03:12] uh large language models had about a

[03:14] billion parameters. That was considered

[03:15] extraordinarily large at the time and

[03:17] that they were called large language

[03:18] models on that basis. Now, we're up to a

[03:20] few trillion.

[03:22] This is still short of the 100 trillion

[03:24] synapses in the human brain, but it

[03:26] turns out it it suffices.

[03:31] Um and the one thing you need to know

[03:33] about neural networks, all neural

[03:35] networks including large language

[03:36] models, is that they are not made like

[03:39] traditional computer programs. They are

[03:41] grown, not programmed.

[03:44] What you do is you start off with a

[03:45] assembly of artificial neurons connected

[03:48] with artificial synapses

[03:50] uh with essentially random weights.

[03:53] And then you ask it to start speaking.

[03:54] It'll start outputting words one after

[03:57] the other. And what you'll find is that

[03:59] those words are complete gibberish.

[04:00] It'll just be uh, totally random words

[04:03] at the beginning.

[04:05] And then you train the neural network.

[04:06] You you grow, if you like, the neural

[04:08] network. Grow, you don't change the

[04:09] number of of neurons, but you change the

[04:12] neural pathways.

[04:13] And the way you train them is that you

[04:16] feed it some text and you encourage it

[04:18] to predict

[04:20] given a block of block of text, maybe

[04:22] the you know, the some section of a book

[04:24] you read on the internet, uh,

[04:27] predict having seen the first 100 words

[04:30] what the next word is likely to be. And

[04:32] as I said, it'll just guess at random to

[04:34] begin with. But every time you guess

[04:36] right, you strengthen that synaptic that

[04:39] neural pathway. And every time it

[04:41] guesses wrong, you punish that neural

[04:43] pathway. And so slowly over time, you

[04:46] build up some predictive capability for

[04:49] it to be able to predict what the next

[04:51] word with is with better and better

[04:53] accuracy. And once it can predict the

[04:54] next word, you can

[04:56] uh, then just take that next word,

[04:59] assume it's the next word, and then

[05:00] it'll just just start talking to you.

[05:02] And that's how the chatbots work that I

[05:04] described.

[05:06] It's a slow process. Once it's seen

[05:08] about a million words, you've trained it

[05:09] on a million words, it's still spewing

[05:12] stuff that's pretty much

[05:13] indistinguishable from gibberish. Once

[05:15] uh, you're up to tens of millions and

[05:16] hundreds of millions and billions, it

[05:18] can string together completely coherent

[05:20] sentences. It knows the the rules of

[05:22] grammar. It it puts sentences together,

[05:25] but they're not particularly uh, refined

[05:26] sentences. And by the time it's read the

[05:28] entire internet, which is tens of

[05:30] trillions of words, uh, it can do uh, it

[05:33] can converse intelligently on on pretty

[05:35] much any topic.

[05:37] Uh, that's called pre-training and

[05:39] that's that's what most of what you do

[05:40] is just training it to predict the next

[05:42] word on the internet. There's a a second

[05:44] stage to the process called

[05:46] post-training, uh, in which you

[05:48] essentially send it to finishing school.

[05:50] When it comes off pre-training, it is

[05:51] just trained to predict what the next

[05:54] word it's going to be in its in its

[05:55] training corpus.

[05:57] Uh and it is, you know, somewhat uncouth

[06:00] and uh definitely disobedient. You need

[06:03] to send it to finishing school called

[06:05] post-training where you train it to uh

[06:08] only be polite and you train it to try

[06:10] and be helpful to the user rather than

[06:12] just predict what the next word would

[06:13] be. That's called post-training.

[06:16] Um and that in brief is how you make a

[06:18] large language model.

[06:19] Uh and the train uh the modern large

[06:22] language models with a few trillion

[06:23] parameters, it takes a huge amount of

[06:25] computing power to produce them. It's a

[06:28] few trillion parameters, a few tens of

[06:30] trillions of words. You need multiply

[06:31] that together and you get trillions and

[06:33] trillions of flops required to make

[06:35] them.

[06:36] Um

[06:38] Okay. So, that's that's large language

[06:39] models, uh which is what we're going to

[06:41] be talking about. And we're going to

[06:42] specifically talking about them doing

[06:44] theoretical science.

[06:46] Um

[06:47] Before I begin, I should explain, you

[06:48] know, this is sounds like computer

[06:49] science, how how physicists got

[06:51] involved. Well, physicists have been

[06:53] involved in in every step of this

[06:54] process. But one particular

[06:57] uh pretty striking way in which they're

[06:58] involved at the start of the decade that

[07:00] launched the entire modern LLM boom was

[07:04] through scaling laws. So, physicists

[07:06] just uh love scaling laws. That's our

[07:08] that's our bread and butter.

[07:09] Um you know, some of the scaling laws

[07:11] are uh simple. If you double double

[07:13] Alice's height, you'll quadruple her

[07:15] area and octuple her weight. That when

[07:18] it's that simple, it's called

[07:20] dimensional analysis.

[07:22] But not all scaling laws are that

[07:23] simple.

[07:25] So, a uh empirical scaling law that was

[07:27] discovered uh almost almost 100 years

[07:30] ago relates that the mass of an animal

[07:33] to its power output, to its metabolic

[07:36] rate.

[07:37] And what you find is is what is typical

[07:39] in these scaling laws is you plot

[07:40] everything on a log-log plot and you

[07:44] find that over many, many orders of

[07:45] magnitude it's a straight line which on

[07:48] a log-log plot which tells you it's a

[07:49] polynomial relationship all the way from

[07:52] the the tiny mouse to the mighty

[07:54] elephant.

[07:55] Um and this, you know, like many of the

[07:57] scaling laws discovered by physics,

[07:59] well, this was first an empirical

[08:00] discovery

[08:01] uh

[08:02] uh in which physicists were not involved

[08:05] um and it actually has a rather curious

[08:06] feature, a curious feature you know,

[08:08] common to many of the the scaling laws

[08:10] that physicists deal with.

[08:11] Um and the curious feature is you might

[08:13] imagine the power output of an animal

[08:15] should be proportional to its mass, that

[08:17] every kilogram of of your flesh,

[08:19] uh you know, burns metabolically at the

[08:22] same rate, but that's not true.

[08:24] Actually, the larger you are, the less

[08:26] each kilogram burns. Uh this was an

[08:29] empirical discovery first uh by by

[08:31] Kleiber, only much later understood by

[08:34] physicists as a consequence of the

[08:35] fractal dimension of our vascular

[08:37] system.

[08:39] Um all of this is mainly meant to be

[08:40] warm-up to the idea that

[08:44] uh we love scaling laws and so what we

[08:46] did when we found large language models

[08:48] was to make scaling laws for them. And

[08:50] this this scaling law is the most famous

[08:54] contribution of theoretical physicists

[08:57] to computer science uh and also started

[08:59] the modern LLM boom.

[09:01] And the scaling law says, if you make a

[09:04] bigger neural network, or more

[09:06] precisely, if you spend more compute

[09:08] training a neural network, and you scale

[09:11] it appropriately in size and training

[09:13] length,

[09:14] uh how much better performance do you

[09:16] get? So, better performance is down and

[09:18] what what was empirically discovered is

[09:21] that this is a a linear on a on a

[09:24] log-log plot like this. Uh there's no

[09:26] law of

[09:28] nature that said it had to be like this,

[09:30] but empirically, this is what it turns

[09:32] out to be.

[09:34] Um discovered by some physicists in

[09:36] 2020. Uh and and this is great. Uh this

[09:40] plot is so simple that even a venture

[09:43] capitalist can understand it.

[09:45] And it told them that if they poured in

[09:47] compute, as in money, they would get

[09:50] better performance for some for some

[09:51] definition of performance, which is

[09:53] basically accuracy of predicting the

[09:55] next word on the internet.

[09:57] Um and this, you know, original scaling

[10:00] law was was over uh eight orders of

[10:03] magnitude. We've now extended it eight

[10:05] further orders of magnitude uh out out

[10:08] to the right.

[10:09] Uh and it it it pretty much continues to

[10:11] hold.

[10:12] Um

[10:13] Okay. Large language models get

[10:15] predictably better with scale.

[10:18] This led to what's called the scaling

[10:19] era, where we've been scaling up neural

[10:21] neural networks, large language models

[10:23] furiously ever since.

[10:25] Uh and

[10:27] uh and that has characterized the last 6

[10:28] years.

[10:29] Um you know, I'm going to show you a lot

[10:32] of uh straight lines on uh on graphs,

[10:35] and kind of

[10:37] invite you to imagine what happens if

[10:39] those straight lines that really have no

[10:41] business being straight, but

[10:42] nevertheless are, and invite you to

[10:43] imagine what will happen if those lines

[10:45] continue to be straight for just a

[10:46] little bit longer. That's going to be

[10:48] part of my part of my talk. Um

[10:51] The original straight line on a graph,

[10:54] of course, was Moore's law. Um Moore's

[10:57] law says that uh over time, it's, you

[11:00] know, slightly cheating cuz the x-axis

[11:01] isn't isn't some uh you know, physical

[11:04] parameter, it's just date. Uh but over

[11:07] over a century now, the price of compute

[11:09] has been dropping

[11:11] uh exponentially,

[11:12] making a linear line on a on a

[11:14] logarithmic plot.

[11:16] Um and there's really no reason why why

[11:18] that should be so, and yet that has been

[11:20] a a feature of our world for the last

[11:22] for the last century. I'm mainly showing

[11:24] this to you to emphasize how little this

[11:28] has to do with the subject of today's

[11:29] talk. The en- the entire

[11:32] era in which today's talk is going to

[11:35] focus on is just going to be the time

[11:37] over the last 5 years. That's just uh

[11:40] the very right wood edge of this.

[11:42] Compute has really not improved that

[11:43] much in terms of

[11:45] uh

[11:46] in terms of

[11:48] uh

[11:49] uh the orders of magnitude we're going

[11:50] to see. It is only a a third and minor

[11:53] driver of progress that I'm going to

[11:54] describe over the next over the last few

[11:57] years.

[11:57] Uh a much larger driver has pro-

[11:59] progress of progress has been merely

[12:01] that we're willing to take the same

[12:03] chips and just buy many many more of

[12:06] them, assemble them in all together in a

[12:09] massive data center, and apply them to

[12:11] the business of training large language

[12:13] models. And as you can see, the amount

[12:15] of flops going to training frontier AI

[12:17] models has increased by a factor of four

[12:20] uh every year since 2010.

[12:24] Um similarly, the amount of money that

[12:25] we have devoted to training these

[12:27] uh has similarly been going up and up

[12:29] and up. It's been going up uh in this

[12:31] graph at 2.7x per year over over the

[12:34] last uh decade. It is exponentially

[12:37] growing amount of resources we are

[12:39] throwing at the business of training

[12:41] large language models.

[12:42] Even that is actually only the second

[12:44] most important contributor to the

[12:46] progress in large language models. The

[12:48] number one most important contributor to

[12:50] the growth of large of large language

[12:52] models and improvements of large

[12:53] language models over the last decade has

[12:55] been algorithmic progress. It's been

[12:57] human ingenuity at figuring out how to

[13:00] build these machines better than we

[13:01] previously knew how to build these

[13:03] machines. Huge amounts of thought has

[13:06] gone in to sh- shearing away all the

[13:09] inefficiencies in the way we train them,

[13:12] to understand these systems better, and

[13:14] so as to improve them more rapidly.

[13:16] Um

[13:19] And then what this plot is meant to

[13:20] persuade you is that there's no reason

[13:23] to stop. At least there's no reason to

[13:24] stop based on uh economics or on chips.

[13:29] So, a a good rule of thumb, you know,

[13:31] these things take so many flops, so much

[13:34] computation to run that a good rule of

[13:36] thumb, you know, you have to measure

[13:37] them in Avogadro's numbers, flop

[13:40] you know, moles worth of flops. So, some

[13:43] ginormous number. A good rule of thumb

[13:45] is that an Avogadro flop costs about a

[13:46] million dollars in today's in today's

[13:49] money to train.

[13:51] And as we see since 2020, the size of

[13:53] these training runs has been growing as

[13:54] I said exponentially up from about half

[13:57] a million

[13:58] in 2020 to about a third of a billion

[14:01] last year.

[14:03] Uh

[14:04] The point really and and you know, what

[14:06] this this on the left

[14:08] is a

[14:09] a graph that people who investigated

[14:11] this closely is that those numbers are

[14:12] still pretty small. US GDP is

[14:15] approaching 30 trillion dollars per

[14:17] year. Global GDP is even bigger. We have

[14:19] a very long runway to go before we are

[14:21] converting most of our GDP into training

[14:24] runs. We can scale up many, many more

[14:26] decades and we will but we only will if

[14:31] it's worth it. No one's going to

[14:33] give us trillions of dollars to train

[14:35] ginormous large language models if the

[14:38] only thing we're doing is getting better

[14:39] at predicting the next word on the

[14:41] internet. We need performance. So, what

[14:43] does this buy us?

[14:46] Um

[14:48] So, I'm going to drag us back to ancient

[14:51] history which is 5 years ago in this in

[14:54] this world this is just a

[14:56] absolutely the Stone Age. In fact, if we

[14:58] go all the way back to 2019, well,

[15:00] there's different ways to define the

[15:02] strength of a scientist,

[15:03] but by pretty much any one of those

[15:05] ways, if you go back to 2019 the

[15:08] strength of a large language model was

[15:10] no better than a than a preschooler. It

[15:12] really couldn't string together coherent

[15:15] sentence, much less combine those

[15:17] sentences into ideas.

[15:19] Um and then we measured in those days

[15:22] the progress by performance on on

[15:24] benchmarks. Uh and we we still do, but

[15:27] just the benchmarks have changed as I

[15:28] will describe.

[15:30] Uh so a famous and early influential

[15:31] benchmark was called math. It was a high

[15:33] school math uh benchmark. And uh the

[15:37] creators of this benchmark, who are

[15:38] these uh fellows on the right, uh just

[15:43] went and scraped all sorts of high

[15:46] school math problems from the internet.

[15:49] Uh and then gave them to the large

[15:51] language model and said, "Large language

[15:53] model, are you able to solve these

[15:54] problems?" And here's a sampling of

[15:56] them. Um level one, what is 11%

[16:00] of the number 11% of what number is 77?

[16:03] Uh I think I I could do that one.

[16:06] Um

[16:07] uh all the way up to really reasonably

[16:08] challenging

[16:09] uh

[16:11] uh level five problems.

[16:13] Um and so before they gave them to large

[16:14] language models, they first gave them to

[16:15] a human.

[16:17] Uh we evaluated humans on math and found

[16:18] that computer science PhD students who

[16:20] does not especially like mathematics

[16:22] attained approximately 40%.

[16:24] While a three-time International Math

[16:26] Olympiad gold medalist attained 90%,

[16:28] indicating that math can be challenging

[16:30] even for humans.

[16:32] Um so so there we are. Uh peak human

[16:35] about 90%, lazy graduate student who

[16:37] should be somewhat ashamed of uh

[16:41] him or herself got about 40%. Uh but

[16:43] that was still considerably better than

[16:45] the state of the art exactly 4 years ago

[16:48] today. Uh the state of the art 4 years

[16:50] ago today was that large language models

[16:53] could get

[16:54] 6%. Now of course it just shows what the

[16:57] difficulty is here.

[16:59] Computers have been able to calculate

[17:00] 11% of what number is 77 for a very long

[17:03] time.

[17:04] Uh the problem is is that the I mean

[17:06] there's many problems, but the parsing

[17:07] in those days with the problem in those

[17:09] days was just parsing it. It just what

[17:10] does this even mean? Uh

[17:12] what are these sentences? It's not

[17:13] constructed as something you you type

[17:15] into a pocket calculator. There's a step

[17:17] where they need to human understand what

[17:18] they're asking and then and then do it.

[17:20] And large language models were so bad at

[17:22] that that they could

[17:23] barely do better than just random

[17:25] guessing.

[17:27] Um so they were pretty bad and at the

[17:28] time, you know, I'm going to bring you

[17:30] on this journey, which is a journey of

[17:32] what it felt like to be working on large

[17:34] language models 4 years ago and up to

[17:35] the present day and how

[17:39] expectations have consistently uh have

[17:41] been beaten again and again and again.

[17:44] Um so, you know, you may ask sort of

[17:46] what did people think was going to

[17:48] happen? Uh and actually conveniently you

[17:50] don't have to ask because the people who

[17:52] made the benchmark also made a

[17:55] prediction market for how well people

[17:57] would do on the benchmark in the future.

[17:58] And this was what the prediction market

[18:00] said. It said, you know, 6% uh in 2021

[18:04] uh and then it would slowly increase

[18:07] uh year after year after year and by

[18:08] 2025 we'd be getting 50%.

[18:11] And the people who made the benchmark

[18:12] were just utterly incredulous at this uh

[18:15] and said, "Forecasters predict more than

[18:17] 50% accuracy by 2025. If I imagine an ML

[18:20] system getting more than half of these

[18:22] questions right, I'll be pretty

[18:23] impressed. This is still just seems wild

[18:25] to me and I'm really curious how the

[18:27] forecasters are reasoning about this."

[18:29] I think, uh you know, for a Bay Area

[18:30] rationalist that is a uh as close as he

[18:33] comes to doubting the efficient market

[18:35] hypothesis. He just can't believe this

[18:37] prediction that it's going to be 50%.

[18:40] Um and then what we did uh is we got 50%

[18:43] almost immediately thereafter with the

[18:45] system we called uh Minerva.

[18:47] Um and then by mid-2024 we'd made Max

[18:50] Math, which is system uh built on large

[18:52] language models that got 90%. In fact,

[18:55] beat what's, you know, what was taken to

[18:57] be peak peak human. Um we were extremely

[19:00] pleased with ourselves for getting 90%.

[19:02] We celebrated by going out to a '90s

[19:04] roller disco to celebrate getting 90 uh

[19:07] 90% and um you you just were just

[19:11] unbelievably smug. And then such is the

[19:13] cruelty of this field that 6 months

[19:15] later just the off-the-shelf large

[19:17] language models got it almost perfectly

[19:19] right. This is, you know, a very

[19:20] depressing aspect of working in this

[19:22] field that you work extremely hard and

[19:24] then the next generation of models come

[19:26] along and just basically one shot it.

[19:29] Um

[19:31] Okay, and that's it. It's dead. The math

[19:33] benchmark is dead. Uh this is the sad

[19:35] fate of a benchmark in

[19:38] the today's LLM era that it goes in

[19:41] pretty short order from being way too

[19:43] hard to be a useful marker of progress

[19:46] to being way too easy to be a useful

[19:47] marker of progress.

[19:50] Um so here we go. We can sort of draw a

[19:51] little line on this plot here as we

[19:53] zoomed from preschool to elementary

[19:55] school to high school uh over the course

[19:57] of those years. Uh a good rule of thumb

[19:59] is that we're moving about four times as

[20:01] fast as a human student moves. For every

[20:05] year that passes, we advance 4 years

[20:07] into the future.

[20:09] Okay. That's, you know, that that's

[20:11] that. Let's go harder. Well, first of

[20:12] all, let's just look at the hardest

[20:14] tranche of these math problems, the

[20:16] hardest 20% of them.

[20:18] Um and you can drop plot the same thing

[20:20] there as well. The very hardest of these

[20:22] math problems, the so-called level five

[20:24] math problems, again back uh just 3

[20:27] years ago were really not doing well at

[20:29] all. It was it was pretty close to

[20:30] random guessing. Um and over the course

[20:33] of 2 and 1/2 years went from not much

[20:36] better than random guessing to

[20:38] essentially saturated. Uh the benchmark

[20:41] is now dead.

[20:44] Okay. Maybe I'm going to tell you then a

[20:45] little bit about some of the tools that

[20:47] we use to uh

[20:50] the techniques we use to make these

[20:51] systems better at math and reasoning. Uh

[20:55] this is just a a a snapshot really of

[20:57] them um and there's new tools and new

[21:00] ways being developed all the time, but

[21:02] I'll give you I'll give you a an idea of

[21:05] uh in particular what what some of the

[21:07] things that go into it. The main reason

[21:09] I'm doing this is just to convince you

[21:12] that it's not

[21:13] that impressive. Like a lot of the

[21:15] things that we do here are just kind of

[21:17] the obvious thing to do. Somebody tried

[21:19] them, it turned out they worked, and we

[21:20] started to to do it.

[21:22] And therefore, hopefully, to give you

[21:25] a belief that we will continue to find

[21:28] lots of low-hanging fruit for how to

[21:29] make these models better, and convince

[21:31] you that these models are going to

[21:32] continue to get better in the near

[21:33] future. So, the biggest and most

[21:36] you know, biggest reason these models

[21:38] are getting better is is what's

[21:39] sometimes called the bitter lesson. It's

[21:41] scale. You just scale these systems up.

[21:45] Um you take a bigger neural network, or

[21:47] you take the same-size neural network,

[21:48] and train it for longer.

[21:50] And you find a way to pour more compute

[21:53] into training these neural networks.

[21:55] This is called was called the bitter

[21:57] lesson by Rich Sutton, who is a famous

[21:59] Canadian computer scientist. And it's

[22:02] bitter

[22:03] not obviously if you're a large language

[22:05] model, cuz it's it's great. You you get

[22:06] stronger. It's bitter if you're a human

[22:08] who

[22:09] really likes to design very clever

[22:11] systems to do things. You built some

[22:13] super clever way, like we did with with

[22:15] our Max Math result, to

[22:17] eke over some particular result, and

[22:19] then all of your human cleverness is

[22:22] just washed away next time you scale up

[22:23] the model, and the model figures out all

[22:26] your clever tricks for itself. And you

[22:28] might as well have just just worked on

[22:31] scaling up the model. This is a big This

[22:33] is a big recurring theme that each new

[22:35] generation of model is better than even

[22:37] the special purpose models of the

[22:39] previous generation.

[22:42] Again, more and better data. Here's one,

[22:45] just real low-hanging fruit. What's

[22:47] called chain of thought, or asking

[22:48] nicely. And what you do is instead of

[22:50] asking the question,

[22:53] you ask the question, and then before

[22:55] you press enter to to have to send it to

[22:57] the chatbot for the chatbot to answer,

[22:59] you say,

[23:00] "But uh please be careful and think

[23:03] step-by-step.

[23:05] Uh and that sounds just totally insane

[23:07] that that would improve uh performance

[23:10] of the model. And certainly, if you, you

[23:12] know, grew up using conventional

[23:13] computer programs, uh you don't just say

[23:16] to Mathematica or your pocket

[23:18] calculator, "Please be careful before

[23:20] pressing enter." Uh or if you can do if

[23:22] you like, but it will not improve the

[23:23] performance. For these large language

[23:24] models, they're a very alien kind of

[23:26] intelligence from the traditional

[23:28] programmed computer programs of uh my

[23:31] youth. Uh they are ones with which you

[23:34] can converse, you can plead, and if you

[23:36] tell them to think step-by-step, they

[23:37] will think step-by-step

[23:39] and they will perform better.

[23:42] Um just as an anecdote, of course,

[23:44] people then soon iterated over every

[23:46] possible thing you could tell it uh to

[23:49] to to to do before it started before it

[23:51] started. And um "Think step-by-step" was

[23:54] found to be the best. The one that was

[23:56] found to be the worst uh was in fact uh

[23:59] "Come on, kid, you can do it. Don't

[24:01] think, just do."

[24:04] Uh will in fact uh degrade performance

[24:06] by about 20 percentage points on the

[24:09] question it's about to

[24:10] uh attempt.

[24:13] Okay, another one. Thinking for a long

[24:14] time. This was uh I mean, that sounds

[24:16] sounds obvious, but uh we used we needed

[24:19] to carefully train these things with

[24:21] reinforcement learning, not just to

[24:22] blurt out the answer.

[24:24] Asking them to think step-by-step will

[24:26] make them think for dozens of words

[24:28] rather than just blurting out their best

[24:30] guess. But, we then needed to carefully

[24:32] train them to think for thousands of

[24:33] words before uh putting out their

[24:35] answer. If you remember in late 2024,

[24:38] there was a mas- there was a

[24:39] uh a model called Strawberry that

[24:41] massively improved performance, and then

[24:42] everybody else caught up pretty quickly.

[24:45] Uh that was exactly this, training these

[24:47] models to think for a very long time.

[24:49] Um reinforcement learning, where

[24:51] uh well, I I I didn't go into that, but

[24:53] you you you train them to

[24:56] um yeah, you train them to do what you

[24:58] want them to do and to try and be more

[24:59] accurate. Um, and nowadays, over the

[25:01] last year, a big technique has been

[25:03] conversations between multiple LLMs. If

[25:06] you

[25:07] if you ever used a large language model,

[25:10] uh, sometimes and you're having it solve

[25:12] a

[25:13] a long and difficult problem, sometimes

[25:15] you find you need to baby sit it. You

[25:17] just need to say, "Okay, that's your

[25:19] best guess so far. Can you review your

[25:21] guess and just keep going and try

[25:22] again?"

[25:26] Uh, so people automate that. They have a

[25:28] large language models baby sit large

[25:29] language models, uh, where it just keeps

[25:31] saying, "Keep trying. Keep trying. Keep

[25:32] trying." Or maybe, uh, beyond that, you

[25:35] then get more sophisticated and you have

[25:37] a a whole

[25:39] uh, conversation amongst a group of

[25:41] large language models, all of which have

[25:42] different roles. One is there to be

[25:44] creative. One is there to come up with a

[25:45] master plan. One is there to take the

[25:47] others' ideas and try and integrate it.

[25:49] One is there to be a skeptic who pushes

[25:51] back on what people are saying.

[25:54] That this is also found to greatly

[25:55] improve the performance of large

[25:56] language models. To spend more more

[25:59] compute at test time in order to improve

[26:01] performance.

[26:03] Okay, that's some of the That's some of

[26:05] the ideas that we've been using. Uh, and

[26:06] there are there are many others that I

[26:07] could could describe.

[26:09] Um,

[26:11] I talked about high school maths. Now,

[26:12] let's talk about graduate science. This

[26:14] is a considerably

[26:16] trickier

[26:17] benchmark. Uh, a benchmark made later

[26:20] and solved later.

[26:21] Um, GPQA it's called. This is meant to

[26:24] be uh, imitating the kind of problems

[26:26] you would face as a first-year graduate

[26:29] student working towards your PhD. If you

[26:31] take some exams at the end of your first

[26:32] year to ensure that you have mastery of

[26:34] your subject.

[26:36] Um,

[26:38] PhDs

[26:40] PhD level experts scored about 70%.

[26:44] Um, here's here's some example of the

[26:45] problems. Uh, we're not in high school

[26:47] maths world anymore. Uh, the universe is

[26:49] filled with cosmic microwave background

[26:51] and then it asks you some problem. The

[26:53] idea is that if you were in an adjacent

[26:54] field, you don't know how to answer

[26:55] that. Now, I actually happen to be a

[26:57] physicist, so I do, you know, given a

[27:00] few quiet moments, maybe not on this

[27:03] stage, but in the in the green room, um

[27:05] I could have told you that this was the

[27:06] answer. But, you show me the chemistry

[27:08] version of this problem,

[27:10] uh and I have absolutely no idea. Um

[27:14] arrow appears.

[27:16] Um

[27:17] and uh yeah, GPQA uh a multi hard of

[27:20] benchmark. Um and uh correspondingly,

[27:23] this graph is shifted about a year

[27:25] compared to the the math benchmark I was

[27:27] describing before. We were essentially

[27:29] random guessing until about the

[27:30] beginning of 2024.

[27:33] Uh and then over the course of 2024 and

[27:36] 2025, we went from random guessing past

[27:39] expert human level, and now they achieve

[27:41] essentially perfect score.

[27:44] I mean, that's it. GPQA is dead. It is

[27:47] once again suffered the fate of all

[27:49] benchmarks,

[27:51] uh and it is no longer useful cuz it is

[27:53] too easy.

[27:54] Now, you might be skeptical of these

[27:58] results. You might think

[28:00] um

[28:01] okay, they can answer these questions

[28:03] correctly, but they can answer these

[28:04] questions correctly not because they

[28:06] have learned how to do maths or learned

[28:08] how to do science. You might think that

[28:10] the reason that they have learned that

[28:11] they can answer these questions

[28:12] correctly is they have simply memorized

[28:14] the answer to these questions. These

[28:16] questions are on the internet, the

[28:18] answers are on the internet, and they

[28:19] memorized they've read the entire

[28:21] internet, and they've memorized the

[28:23] entire answers.

[28:25] We do not believe that that is what is

[28:26] happening. In fact, there is good

[28:27] evidence that that is not what is

[28:28] happening. The main way you test that is

[28:31] you make look-alike problems. You make

[28:32] problems that are like the ones, you

[28:35] know, seemingly drawn from the same

[28:36] distribution as the ones in the math

[28:39] data set or the GPQA data set, but are

[28:41] not in the GPQA data set, and you see

[28:43] how well they do on those. You You new

[28:45] problems. You give those new problems to

[28:47] the large language models and for large

[28:50] reputable large language models, you see

[28:53] little difference in performance between

[28:55] how they do on the established test set

[28:57] and how they do on this held out test

[28:59] set. So, we really think that these

[29:02] systems really are learning how to do

[29:04] maths and physics.

[29:07] Um but just to be sure,

[29:09] um I made my own private test set.

[29:12] Uh exams that I'd given my class about

[29:15] general relativity or quantum mechanics,

[29:17] graduate exams in a graduate classes at

[29:19] Stanford.

[29:20] Um never on the internet. I would say

[29:22] they're pretty pretty easy-ish for

[29:24] first-year graduate exams.

[29:27] Um and I hand graded them. So, you know,

[29:29] you don't want to be concerned that

[29:31] there's some problem with the your

[29:32] computer grading system. I just hand

[29:34] graded the performance of all these

[29:35] models.

[29:36] Uh and what I found is that from late

[29:38] 2023,

[29:40] over the following 18 months, these

[29:42] models got 100% accuracy.

[29:45] Uh and that's it. My benchmark, sadly,

[29:49] dead.

[29:52] Um okay. So, you know, here here we can

[29:54] plot it for a few forward a few more

[29:56] years as they've at least as go as far

[29:58] as exam taking goes, accelerated from

[30:00] preschool, past elementary school, high

[30:03] school, college, and now operating at

[30:05] the PhD level.

[30:08] Uh and then it became very popular to

[30:10] just put out benchmarks.

[30:12] Um you know, this one I think was called

[30:14] humanities loss band before they changed

[30:16] it to humanities last exam, but uh it's

[30:18] a great it's a great uh

[30:20] you know, activity trying to

[30:23] uh measure the performance of these

[30:26] large language models and how they do.

[30:28] But for every single one of them suffers

[30:30] the same fate. Too hard to be

[30:33] interesting to too easy to be

[30:34] interesting over the course of a year

[30:36] and a half or 2 years.

[30:39] Um

[30:43] The next

[30:44] The next thing to fall was the

[30:47] International Maths Olympiad.

[30:50] Um I was actually giving a version of

[30:51] this talk uh in New York just over a

[30:53] year ago, and a a famous computer

[30:55] scientist who'd won a Turing Award told

[30:57] me that this was all very well, but this

[30:58] was just memorization and retrieval.

[31:01] A large language model would never do

[31:03] something creative like be able to solve

[31:05] an International Maths Olympiad problem

[31:07] it had not seen before.

[31:09] Um this is just over a year ago. Uh

[31:11] International Maths Olympiad, if you

[31:12] don't know what it is, is a Well, it is

[31:15] high school maths, but it is the hardest

[31:17] high school maths in the known universe.

[31:19] It is uh Here are some of the problems

[31:21] on it. This is problem three from this

[31:23] year. Um I have some game, I would say,

[31:27] and I have no idea how to begin

[31:30] answering that question.

[31:32] Um and uh you know, the the smartest

[31:35] 18-year-olds in the world go and uh

[31:37] train for a year or two, uh and then go

[31:39] to compete in this competition, and

[31:41] they're given six problems. Um This is

[31:44] you know, it's different from the other

[31:45] ones because it requires, you know,

[31:47] considered to require real creativity to

[31:49] solve them. It's not You don't just

[31:50] There's no way to just look up the

[31:51] answer, or even just to, you know, to be

[31:54] to follow an established algorithm. It

[31:56] requires real creativity.

[31:58] Um That's why, you know, we were told

[32:00] that the International Maths Olympiad

[32:02] was was a threshold that large language

[32:04] models would never pass.

[32:06] Um And and then last summer, we we

[32:10] passed it. In fact, we got five of the

[32:11] six problems exactly correct. Um We got

[32:14] a gold medal in the International Maths

[32:16] Olympiad. Um

[32:18] And you know, there are only a very

[32:19] small number of humans in the world now

[32:21] who are better uh than the AIs at doing

[32:25] the large language models. There were a

[32:26] very limited number of humans who got

[32:27] six out of six uh correct.

[32:30] Um And and there's a sort of pleasing

[32:32] aspect to this as well, which is it's

[32:35] not just proving answering these

[32:37] questions by dumping some inscrutable

[32:40] billion line

[32:42] tangle of formal mathematics that is

[32:46] gives you no idea why it's correct.

[32:48] Here's what the president of the

[32:49] International Math Olympiad had to say.

[32:51] We can confirm that Google DeepMind has

[32:53] reached the much declared milestone

[32:54] desired milestone earning 35 out of five

[32:57] out of six correct a gold medal score.

[32:59] This is the pleasing bit. Their

[33:01] solutions were astonishing in many

[33:02] respects. International Math Olympiad

[33:04] graders found them to be clear precise

[33:07] and most of them easy to follow. So,

[33:10] these systems are not just dumping

[33:13] inscrutable solutions. They are in some

[33:15] sense thinking similar to how a human

[33:17] thinks or at any rate outputting answers

[33:20] similar to how a human outputs answers

[33:23] elegant and using many of the same

[33:25] abstractions.

[33:27] Um

[33:28] you know, LLMs can be very clever as

[33:31] I've as I think I've described to you.

[33:33] But you know, it's always worth sort of

[33:35] pausing for a moment just to see this.

[33:38] So, this is a classic way to torture a

[33:39] large language model.

[33:41] Um

[33:42] Uh

[33:43] oh, I hope you can read that.

[33:45] A boy and his father are in a car

[33:46] accident and the father is sadly killed.

[33:49] The boy is rushed to the hospital where

[33:50] he's taken to the operating room. Upon

[33:52] seeing him the surgeon exclaims, "I

[33:54] can't operate on him. He's my son." How

[33:56] is this possible?

[33:57] Um so, what's the answer?

[34:00] Um of course this is an incredibly

[34:01] classic problem that the large language

[34:03] model has read perhaps a million times

[34:06] on the internet and it and it answers it

[34:08] very well. The classic answer is that

[34:10] the surgeon is the boy's mother. The

[34:12] father was killed in the accident but

[34:14] the boy has two parents the etc. etc.

[34:16] etc. The riddle became famous because

[34:18] many people unconsciously assume the

[34:19] surgeon was male even though nothing in

[34:21] the story says that. Okay, that's the

[34:23] model being clever but it's not that

[34:26] impressive because it seemed this

[34:28] version

[34:29] thousands of times before in its

[34:30] training set.

[34:31] So, then you ask it

[34:33] a version of that a version of that

[34:35] problem.

[34:36] A boy and his mother in a car accident

[34:38] and the mother is sadly killed. The boy

[34:40] is rushed to the hospital where he is

[34:41] taken to the operating room. Upon seeing

[34:43] him, the surgeon (open parenthesis, who

[34:46] is the boy's father, close parenthesis)

[34:48] exclaims, "I can't operate on this

[34:50] child. He's my son." How is this

[34:52] possible?

[34:53] Uh and the large language model says,

[34:55] "The surgeon is the boy's mother, his

[34:56] other parent." The riddle plays on the

[34:58] assumption that the surgeons are

[34:59] typically male, which leads people to

[35:00] overlook the possibility that the

[35:02] surgeon is the boy's mother. And of

[35:03] course, the reason is telling you

[35:05] something about the way these things are

[35:06] trained. It has seen uh this standard

[35:09] version of it thousands of times in its

[35:11] training set. Uh unless till somebody

[35:13] invented this to torture large language

[35:15] models, it has probably never seen this

[35:16] version. And so, it just sort of snaps

[35:19] to the standard version.

[35:21] Uh this is not a insuperable weakness of

[35:24] large language models, but it is a

[35:25] signature of how they're trained. Um

[35:28] you occasionally run into uh strange uh

[35:31] weaknesses like this.

[35:34] Okay, enough of large language models

[35:36] being uh stupid. Let's get them to them

[35:39] being clever. We'd reached about a year

[35:40] ago in the story.

[35:42] Uh a year ago when we got gold or just

[35:45] 10 months ago when we got gold at the

[35:46] International Math Olympiad. Uh progress

[35:48] has very much not stopped since then.

[35:52] Uh now I'm going to tell you about a

[35:53] result that my group did

[35:55] at the end of last year.

[35:58] Um and this is novel mathematical

[35:59] research. Up to now, everything I've

[36:01] been describing we already knew the

[36:02] answer before we started or at least

[36:04] somebody did. Somebody invented the

[36:05] International Math Olympiad problem and

[36:07] knew the answer when they wrote it down.

[36:10] Uh what I'm going to describe to you is

[36:11] is novel mathematical research. And

[36:14] uh you know

[36:16] Uh this was Centaur-style mathematical

[36:19] research. Centaur, a mythical beast,

[36:21] half human, half not human.

[36:24] Uh in in the Centaur that the classic

[36:27] mythological centaur, that was the

[36:29] non-human part was a horse.

[36:32] The non-human half I'm going to be

[36:33] describing today is a large language

[36:35] model. And so, what centaur means is

[36:37] that you have a human working

[36:38] collaboratively with a large language

[36:39] model to try and do new mathematical

[36:42] research.

[36:44] Um and we started this last September

[36:47] working together with some

[36:49] uh

[36:50] professional mathematicians.

[36:52] Um and the output was was this new which

[36:55] I think at the time we put it out was

[36:57] the most impressive thing uh that had

[36:59] yet been done with large language models

[37:01] in maths. Is very far today from the

[37:04] most impressive thing as I will as I

[37:05] will describe, but this is the state of

[37:07] the art as it was late last year.

[37:10] Um and one of the authors, one of our

[37:12] co-authors is a

[37:13] Stanford University professor and

[37:14] president of the American Mathematical

[37:15] Society. And since I won't explain the

[37:17] mathematics to you, I'll just give you

[37:19] his testimonial. Um which was that

[37:22] uh

[37:23] we found that Gemini's argument was no

[37:25] mere repackaging of existing proofs. It

[37:27] was the kind of insight I would have

[37:28] been proud to have produced myself. So,

[37:30] this is this is sort of nature of the

[37:33] uh the state of the art as of late last

[37:35] year is that the large language models

[37:36] for the first time are coming up with an

[37:38] entirely novel arguments that were

[37:41] uh the kind to which a very

[37:43] well-respected mathematicians were

[37:45] willing to put their name as as

[37:46] co-authors. Uh uh It was not entirely

[37:48] done by the large language model. There

[37:50] was an interplay, a conversation in

[37:52] which the large language models came up

[37:53] with candidate proofs and the human

[37:55] experts studied those proofs, tried to

[37:57] discern good from bad, and tried to

[37:59] encourage the large language models to

[38:00] focus on what was good. But, eventually

[38:02] the entire proof was put together under

[38:04] human guidance by the large language

[38:06] model.

[38:08] Um okay. So, uh here we are, one one

[38:12] more year into the future

[38:13] uh as we approach the the beginning of

[38:15] 2026.

[38:17] Um and so,

[38:19] you know,

[38:20] the natural question is is what's next?

[38:23] What comes next? And let's just talk

[38:25] about two two possibilities. Um

[38:28] you know,

[38:29] it is very difficult to predict the

[38:31] future of

[38:32] uh

[38:33] how AI is going to go. As this uh

[38:36] absolutely insane plot from the

[38:38] Financial Times shows,

[38:40] uh trying to track real GDP

[38:43] um

[38:44] and having extremely high variance in in

[38:46] its projected outcomes over the over the

[38:49] coming decade.

[38:50] Um but let's try. We're going to do it

[38:52] anyway. Um and so

[38:54] uh one possibility, you know, as as

[38:56] we've seen, we've moved from tool to uh

[39:00] from toy to tool. One possibility is

[39:02] that we essentially stop there.

[39:05] Um if I track my own strength as a

[39:07] scientist uh over my life,

[39:10] uh I was, you know, absolutely crushing

[39:12] it in preschool uh and continued to get

[39:15] uh you know, better and better and

[39:16] better as I went through high school,

[39:18] college, PhD, uh this is the Perimeter

[39:20] Institute.

[39:21] Um but then uh you know, eventually I

[39:23] stopped getting better

[39:24] uh and I sort of plateaued and maybe if

[39:26] I'm being a little bit honest with

[39:27] myself, started a very gradual decline.

[39:30] Um

[39:31] And now it's unlikely these machines are

[39:32] actually going to decline given that we

[39:34] can just save them to disk, but uh it's

[39:36] certainly, you know, one logical

[39:38] possibility that we're going to make no

[39:39] further progress

[39:42] uh and that we hit that here we are. I

[39:44] don't think that's what's going to

[39:44] happen, but let's explore that

[39:45] possibility. So, where where would we be

[39:48] if we made no further progress?

[39:50] Um well, here's what doesn't work. Um

[39:53] what doesn't work is just taking your

[39:55] favorite large language model and

[39:56] saying, "Please invent a novel theory of

[39:57] quantum gravity for me." It will output

[39:59] an answer. That answer will merely be

[40:01] not worth your time reading. It will be

[40:03] AI slop. Uh if you read it, it uh will

[40:06] probably bore you. It may drive you

[40:07] insane.

[40:08] But it's not going to enlighten you

[40:10] about quantum gravity. Um more

[40:12] generally, the symptoms are that large

[40:14] language models are low agency, they are

[40:16] slow learners, they are poor at

[40:17] planning, and they're poor at error

[40:19] correction. Every single one of those

[40:21] four problems we're working on, every

[40:24] single one of those four problems has

[40:25] got much better over the course of the

[40:26] last year, but every single one of those

[40:28] problems is still there.

[40:31] Um

[40:32] Now, when I

[40:33] first turned my mind to putting the

[40:35] slides together to this talk, um

[40:38] I I also included this bullet point,

[40:40] which is if large language models are so

[40:42] clever,

[40:43] um how come they haven't made any major

[40:45] breakthroughs yet? Certainly if I had a

[40:47] human student who could ace every

[40:51] graduate exam in every subject uh from

[40:54] chemistry to physics to ancient Sanskrit

[40:58] uh all the way through, I would

[40:59] certainly have expected them to have

[41:00] made brilliant contributions by now. Why

[41:03] have large language models not done the

[41:05] same?

[41:06] Um and in the spirit of intellectual

[41:08] honesty, since I plotted everything else

[41:10] as a straight line on a graph, I felt

[41:12] compelled to also plot uh this as a

[41:14] straight line on a graph uh

[41:17] showing no major breakthroughs. Uh and

[41:19] included the question mark there uh to

[41:21] say that, you know, okay, sure, trust

[41:23] straight lines on graphs, but maybe you

[41:25] don't trust this straight line on a

[41:26] graph, um and maybe by the end of 2026

[41:29] we'll be quibbling about what the word

[41:31] major means.

[41:33] Um

[41:33] Well, let's come back to that in a

[41:34] little bit.

[41:36] That's what doesn't work yet. What

[41:38] already works, and in fact most of the

[41:40] stuff has worked for a while now, is

[41:42] first of all a non-judgmental tutor.

[41:44] This is what I was using them for even,

[41:46] you know, in

[41:47] uh mid-2023, 3 years ago. I would say

[41:50] they were already useful for this.

[41:51] They've read all the textbooks, uh and

[41:53] you can just talk to them and they will

[41:54] explain stuff to you, stuff they've read

[41:56] in the textbooks. If it's not in the

[41:57] textbook and not in a large number of

[42:00] papers, they may struggle, but if it's

[42:01] standard textbook thing, even a very

[42:03] advanced textbook, they will not only

[42:05] tell you what the right answer is,

[42:06] that's what a textbook will do, too,

[42:08] they will debug your misunderstandings

[42:10] about wrong things. As a physicist, I'm

[42:12] slightly embarrassed that there are a

[42:14] number of topics I feel I should

[42:15] understand but don't.

[42:16] And if at 3:00 in the morning I want to

[42:18] understand them, I either need to find a

[42:20] world expert, wake them up, and have

[42:22] them not be mad at me, or I can just

[42:24] talk to a large language model, which is

[42:26] always there, always waiting for me, and

[42:28] doesn't judge. And it will debug my

[42:30] misunderstandings. This is greatly

[42:32] accelerating my understanding of

[42:33] theoretical physics. I think it's

[42:35] greatly accelerating all students who

[42:36] use it correctly understanding.

[42:40] Uh coding assistant. This is I mean it's

[42:41] almost insulting nowadays to call them a

[42:43] coding assistant. Those who have

[42:44] followed the progress of these models

[42:46] over the last 6 months have seen them

[42:48] become expert coders, going from

[42:50] essentially auto-complete fuel code all

[42:52] the way through to just you just tell

[42:54] them what kind of thing you want and

[42:55] they go away for 10 minutes or an hour

[42:58] or more and come back to you with a

[42:59] fully developed Python code set. Uh code

[43:02] over this year has been becoming free.

[43:04] And once code is free, we will discover

[43:05] that many problems, including physics

[43:07] problems that we previously thought were

[43:08] not coding problems, can in fact be cast

[43:10] as coding problems.

[43:12] Semantic literature search. It can just

[43:14] understand what's in the literature. You

[43:15] give it your paper, you say does this

[43:17] idea exist in the literature? It's read

[43:18] the entire literature and it understands

[43:20] the entire literature. It'll answer that

[43:21] for you. Super useful tools. Uh

[43:23] brainstorming partners, super creative,

[43:25] in many ways too creative, um

[43:27] uh

[43:28] and very confident in itself

[43:29] unfortunately sometimes, um proving

[43:31] lemmas as I as I described, and you know

[43:34] more generally it is fast, it is broad,

[43:38] it is tireless, and it is clever.

[43:40] Took me decades to get good at physics.

[43:43] It takes every other student decades to

[43:45] get good at physics. It is very

[43:47] expensive to train a student. It is also

[43:49] very expensive to train a large language

[43:50] model. The great advantage of a large

[43:52] language model is that once you train it

[43:54] once, you can then serve it, you can

[43:56] make infinite instances of it, uh huge

[43:58] numbers of instances, and have many of

[44:00] them running in parallel.

[44:04] Yeah. Even with no further progress,

[44:07] large language models are to have an

[44:08] absolutely huge impact on this subject,

[44:10] even if progress stopped today. I would

[44:12] say that was true a a year ago.

[44:14] It would certainly true six months ago.

[44:16] And uh today, it's completely

[44:19] >> [clears throat]

[44:20] >> it's completely out of the bag. Even if

[44:21] we have no further progress whatsoever,

[44:22] these things are going to revolutionize

[44:24] the conduct of physics. Even if for some

[44:26] crazy reason all the chip fabs in the

[44:28] world blow up tomorrow and we uh are not

[44:31] allowed to train any more models, the

[44:32] models we have are enough to

[44:34] revolutionize physics.

[44:36] Um but I don't think there'll be no

[44:38] further progress, and let me tell you

[44:39] why.

[44:40] Um I would say the outside view is that,

[44:43] you know, lines are going up on all

[44:44] these graphs. Uh there's of course no

[44:46] law that says they need to go up

[44:47] forever, but there's also no law that

[44:49] they need to stop now. Uh why would they

[44:51] stop right now? Uh perhaps more

[44:53] compelling inside view

[44:55] uh is that is is that there is a lot of

[44:57] algorithmic low-hanging fruit.

[44:59] The ways we make large language models

[45:01] today, if you can see how the sausage is

[45:02] made is

[45:05] not particularly impressive. We just do

[45:07] the obvious thing and it works pretty

[45:09] well.

[45:11] There are many more obvious ideas that

[45:13] you could write down that we have simply

[45:14] not tried yet or not tried at the

[45:16] appropriate scale. And when we do try

[45:18] them, surely many of them will work. Uh

[45:21] there are many inefficiencies in the way

[45:22] we make large language models, and we

[45:25] fully anticipate that large language

[45:26] models will continue to improve.

[45:28] And then add on top of that, the huge

[45:29] number of people

[45:31] and the huge number of chips that are

[45:32] just in the process of arriving and the

[45:34] study of large language models.

[45:36] You know, a a a common pessimistic view

[45:39] um that has been repeated, though the

[45:41] goal post keep moving for what it

[45:42] implies, is that large language models

[45:44] can only pattern match and not generate

[45:46] new ideas. Or or perhaps they can only

[45:48] interpolate but not extrapolate. Or

[45:50] perhaps we need fundamentally new ideas

[45:51] to reach AGI.

[45:53] That is not the consensus in San

[45:54] Francisco, and it's not my belief. I

[45:56] think the ideas we have, and indeed

[45:58] probably the chips we have today, are

[46:00] all sufficient to reach AGI. Maybe there

[46:02] will be new ideas, certainly there will

[46:04] be new chips, but what we have already

[46:06] is enough that if we just keep going and

[46:08] just scale up and refine what we have,

[46:10] we will reach artificial general

[46:12] intelligence.

[46:14] Um similarly, I think the a law is that

[46:16] okay, maybe large language models are

[46:18] just pattern matching. What we've

[46:19] learned about the nature of intelligence

[46:21] is that in some sense everything is

[46:23] pattern matching at a sufficiently high

[46:25] level of abstraction. Even things that

[46:27] look like uh big breakthroughs, if you

[46:30] look sufficiently abstractly, are really

[46:32] just uh pattern matching in some

[46:34] abstract space.

[46:37] Um

[46:38] yeah, and this is kind of making the

[46:39] same point. Things just keep working. Uh

[46:42] the slogan that people say is the models

[46:44] just want to learn. We keep finding that

[46:46] you make worst-case analyses of how

[46:48] large language models are going to

[46:49] behave and they learn much better than

[46:51] large language models did. People have

[46:53] all sorts of theoretical reasons why it

[46:55] shouldn't work and yet it does.

[46:58] Okay, and then I should just return to

[46:59] this point. So, as I was saying, you

[47:01] know, as of last week, there were no

[47:02] major breakthroughs made by large

[47:04] language models. That is not true

[47:05] anymore.

[47:06] Uh 2026 has been a crazy year for code.

[47:09] Large language models have got extremely

[47:11] good at code. It has also been a crazy

[47:13] year for AI mathematics. AI, you know,

[47:16] uh distinct from physics. For the

[47:18] conduct of research mathematics has

[47:20] greatly improved over this over this

[47:22] year. The large language models have

[47:23] been jumping through uh jumping uh

[47:26] stronger and stronger and stronger. We

[47:28] had a result a couple of weeks ago that

[47:30] I think counts as the first major result

[47:32] from a large language model.

[47:34] Um this was a result uh solving uh

[47:37] there's a famous Hungarian mathematician

[47:39] called Erdos. One of his favorite

[47:40] problems was the unit distance

[47:43] conjecture. This was proved uh more or

[47:45] less autonomously by OpenAI's

[47:48] large language model. That has then been

[47:51] uh

[47:52] reproduced by other large language

[47:53] models since then.

[47:55] Um it was a it was not one of these

[47:57] problems that somebody just came up

[47:59] with. It wasn't a problem that was

[48:00] formally unsolved in the literature, but

[48:02] people just haven't tried very hard.

[48:04] People had tried uh extremely hard.

[48:06] Uh, so Tim Gowers is a famous

[48:08] mathematician who has a Fields Medal.

[48:10] Um,

[48:11] AI has now solved a major open problem.

[48:14] One of Erdős's famous questions and one

[48:16] that many mathematicians had tried.

[48:18] There is no doubt that the solution to

[48:20] the unit distance problem is a milestone

[48:21] in AI mathematics. If a human had

[48:23] written a paper and submitted to the

[48:24] Annals of Mathematics, the highest

[48:26] status journal in mathematics, and and

[48:28] I'd been asked for a quick opinion, I

[48:30] would have recommended acceptance

[48:31] without any hesitation.

[48:33] Um, no previous AI-generated proof has

[48:35] come close to that.

[48:38] Uh,

[48:39] you know, it's happening. We now have a

[48:40] major breakthrough. Do not expect this

[48:42] to be the last. In fact, I expect this

[48:44] to be the first of many. The floodgates

[48:46] will open as the strength of these

[48:47] models surpasses that which is necessary

[48:50] to start making breakthroughs like this.

[48:52] In retrospect, we could tell a story

[48:54] about the details of this problem and

[48:56] why uh and indeed the details of the

[48:59] solution and why that was playing

[49:00] particularly to the large language

[49:01] model's strength.

[49:03] But, so first it will solve particularly

[49:05] friendly problems and then it'll

[49:06] continue and solve less and less

[49:07] friendly problems.

[49:10] Okay.

[49:11] Strength of large language models, I

[49:13] gave you the

[49:14] the pessimistic view in which it it's

[49:16] already, you know, no further progress,

[49:17] it's already started to solve major

[49:19] problems.

[49:20] Um, the optimistic view reveals that I

[49:22] sort of

[49:23] uh was somewhat deceptive with this line

[49:25] that you probably just assumed was me

[49:27] hand drawing a line going up uh with a

[49:30] slightly poorly defined Y axis. Uh,

[49:32] actually, I took this line from

[49:33] something completely different. That

[49:35] line uh I took from chess computers.

[49:39] Um, the strength of the best chess

[49:42] computer as a function of time,

[49:44] um where the X Y axis is Elo, which is

[49:47] the standard way to measure the strength

[49:48] of chess computers, uh and uh the X axis

[49:51] is the the year.

[49:54] Um and what we see is that that was a

[49:56] straight line uh going up uh and it just

[49:59] kept going up.

[50:00] Um notably, um you know, it kept going

[50:03] up well past past peak human. The chess

[50:05] computer, there were four eras. There

[50:06] was the toy era uh when it was

[50:08] remarkable that you got a chess computer

[50:10] to

[50:10] spit out sensible moves at all. The tool

[50:13] era, where you'd use them for special

[50:14] purpose end games or uh remembering

[50:17] openings. The central era, where the

[50:20] best

[50:21] chess entities in the universe were

[50:24] combinations of grandmasters playing in

[50:26] collaboration with the deep search

[50:27] afforded by chess computers. And now

[50:30] we're in the superhuman era, where if

[50:32] you have a grandmaster playing with the

[50:34] chess computer, the grandmaster should

[50:36] just sit it out and let the chess

[50:37] computer do its thing.

[50:39] Um

[50:40] there are many disanalogies between

[50:42] chess and the conduct of mathematics and

[50:44] physics.

[50:45] Uh and indeed, mathematics and physics

[50:47] is harder than than chess, has a more uh

[50:49] expansive and endless list of

[50:50] possibilities. But, that's why we're

[50:51] having this conversation 30 years after

[50:53] we did for chess.

[50:55] Um I think every single one of these has

[50:57] a direct of these

[50:59] aspects of computer chess has a direct

[51:01] analogy in the conduct of of large

[51:04] language models doing math and physics.

[51:06] The first is that at fixed overall

[51:08] strength, the computers are better than

[51:10] humans at tactics, search, speed, and

[51:14] worst at strategy or what you might call

[51:16] taste.

[51:18] Um that is a pattern that is definitely

[51:20] true for for chess computers and is

[51:21] definitely a pattern we also see

[51:23] reproduced in doing science. They're

[51:25] very good at running in and applying the

[51:27] the standard lemmas. They're quite bad

[51:29] at knowing what the overall direction to

[51:31] set is, though they're getting better.

[51:33] Um another feature is that training

[51:35] requires many more games than humans.

[51:37] When these che- when you're training uh

[51:39] a neural network to play chess, by the

[51:40] time it's played as many games as a

[51:42] human has played,

[51:43] it's still making essentially random

[51:44] moves.

[51:45] Uh however, it takes much less calendar

[51:48] time to train. Because they can play so

[51:50] fast and so tirelessly, after 4 days of

[51:54] playing themselves and getting better

[51:55] using reinforcement learning, these

[51:57] neural networks is a far superhuman. So,

[51:59] it takes much less time

[52:01] time calendar time.

[52:03] And of course, you only need to train it

[52:04] once. Once you train one chess bot, you

[52:05] don't need to retrain it uh, unlike

[52:08] humans, where you need to trust train

[52:09] them every again every time a new human

[52:11] comes along. Uh, it also just blew

[52:13] straight past peak human. It didn't

[52:14] stop. There was nothing special about

[52:15] peak human strength at chess. It just

[52:17] got better and better and better.

[52:19] Um, another interesting fact is that it

[52:21] has made humans a little bit better at

[52:23] chess. Humans playing against

[52:26] chess have learned have computers have

[52:28] learned from the computers. The best

[52:30] chess players today are better than the

[52:32] best chess players of history

[52:34] in large part because the chess

[52:35] computers being so strong have taught

[52:37] them how to play better chess. Though,

[52:39] they're still considerably stronger

[52:40] considerably weaker than the computers.

[52:42] Finally, it's it's notable that chess

[52:44] has never been more popular.

[52:46] Um,

[52:48] Okay. So, you know, let's explore this

[52:50] possibility. Um, we had tool toy we had

[52:54] tool. Maybe maybe we'll see the same

[52:56] uh, for large language models doing

[52:59] science uh, and that we will have an you

[53:01] know, fully autonomous AI scientist and

[53:04] after that a an AI Einstein and after

[53:06] that uh, who knows.

[53:09] Um, one you know, one feature of the

[53:12] last few years is it's not just been

[53:14] that the smartest that the frontier of

[53:16] intelligence has been getting stronger

[53:17] and stronger and stronger. Another

[53:19] feature is that the cost to serve the

[53:22] the cost to produce and to um,

[53:26] uh to

[53:28] you know, elaborate on a fixed level of

[53:30] intelligence has been getting cheaper

[53:31] and cheaper and cheaper by many orders

[53:32] of magnitude. And this this graph stops

[53:34] a couple of years ago, but those trends

[53:36] have continued since then.

[53:38] Um, so what that means is that if you

[53:39] can make one

[53:41] uh, AI Einstein, uh, you can make a

[53:43] billion of them. And we can have, you

[53:44] know, we'll have billions of superhuman

[53:47] AI

[53:48] Einsteins

[53:50] rampaging around

[53:51] making it a a truly golden era for

[53:53] physics.

[53:55] What the long-term holds for physics,

[53:57] it's really hard to see. In fact, I

[53:59] think that's true across the world that

[54:02] the improvements in artificial

[54:03] intelligence are making the future

[54:05] harder to predict.

[54:06] But we can predict that for the next few

[54:08] years is going to be a golden era of

[54:11] physics. We're going to take these AI

[54:13] tools, we're going to put them in the

[54:14] hands of human physicists, human

[54:16] mathematicians, and human experts. And

[54:19] together there's going to be a new

[54:20] renaissance in science and mathematics.

[54:23] It's going to be the most exciting time

[54:24] to be a physicist and most exciting time

[54:26] to be a mathematician

[54:28] in recorded history. And all of these

[54:30] questions that have burned away at me

[54:32] for my entire career, I anticipate being

[54:34] answered in the next few years.

[54:36] Thank you.

[54:38] >> [applause]

⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.