---
title: 'Training Sand to Think: Artificial General Intelligence & Future of Physics'
source: 'https://youtube.com/watch?v=Mw60FH5iflI'
video_id: 'Mw60FH5iflI'
date: 2026-06-15
duration_sec: 0
---

# Training Sand to Think: Artificial General Intelligence & Future of Physics

> Source: [Training Sand to Think: Artificial General Intelligence & Future of Physics](https://youtube.com/watch?v=Mw60FH5iflI)

## Summary

The speaker, a theoretical physicist, discusses the transformative impact of large language models (LLMs) on mathematics and physics. He explains how LLMs have evolved from preschool-level performance to surpassing PhD experts in exams, and have recently achieved novel mathematical research, including solving a major open problem. He argues that even without further progress, LLMs will revolutionize physics, and with continued scaling and algorithmic improvements, they will lead to a golden era of scientific discovery.

### Key Points

- **Extraordinary Moment in History** [00:03] — We've figured out how to refine sand into silicon, turn it into chips, assemble them into neural networks, and train them to think.
- **Physicist's Shift to AI** [00:29] — The speaker stopped writing theoretical physics papers to contribute to building machines that produce knowledge on an industrial scale.
- **LLMs as General Intelligence** [01:18] — Large language models are not just special-purpose tools; they can do every part of a theoretical physicist's job, acting as a general intelligence.
- **How LLMs Work** [02:52] — LLMs are neural networks inspired by the human brain, grown rather than programmed, trained by predicting the next word in text.
- **Scale of LLMs** [03:10] — LLMs have grown from about a billion parameters in 2020 to a few trillion today, still short of the 100 trillion synapses in the human brain.
- **Pre-training and Post-training** [05:37] — Pre-training involves predicting the next word on the internet; post-training refines the model to be helpful and polite.
- **Scaling Laws** [07:04] — Physicists discovered scaling laws for LLMs: performance improves predictably with more compute, leading to the scaling era.
- **Scaling Law Graph** [09:01] — A log-log plot shows that spending more compute on training yields linear improvement in performance, a key insight for investors.
- **Drivers of Progress** [11:57] — The main driver is algorithmic progress, followed by scaling compute and money; Moore's law is a minor factor.
- **Early LLM Performance** [14:48] — In 2019, LLMs performed at preschool level; on the MATH benchmark, they scored 6% in 2021, while humans scored 40-90%.
- **Rapid Improvement** [18:40] — Prediction markets expected 50% by 2025, but LLMs reached 50% almost immediately and 90% by mid-2024, then near-perfect scores.
- **Techniques for Improvement** [20:44] — Key techniques include scale, better data, chain-of-thought prompting, reinforcement learning for long thinking, and multi-LLM conversations.
- **Graduate-Level Science** [26:11] — On the GPQA benchmark (graduate-level science), LLMs went from random guessing to perfect scores within 18 months.
- **Private Test Set** [29:07] — The speaker's own graduate exams from Stanford were solved with 100% accuracy by LLMs within 18 months.
- **International Math Olympiad** [30:47] — In 2024, LLMs achieved a gold medal score (5/6 problems) at the IMO, with solutions praised as clear and elegant.
- **Novel Mathematical Research** [35:52] — In late 2024, a centaur-style collaboration produced a novel proof that a top mathematician called 'the kind of insight I would have been proud to have produced myself.'
- **First Major Breakthrough** [47:28] — In 2026, an LLM solved the unit distance conjecture, a major open problem, marking the first major AI-generated mathematical breakthrough.
- **Chess Analogy** [50:55] — LLMs in science may follow a similar trajectory to chess computers: toy, tool, centaur, then superhuman, with AI becoming the dominant scientist.
- **Golden Era of Physics** [53:38] — Even without further progress, LLMs will revolutionize physics; with continued improvement, we'll have billions of AI Einsteins, leading to a golden era.

### Conclusion

Large language models have rapidly advanced from preschool-level performance to surpassing PhD experts and making novel research breakthroughs. This progress, driven by scaling and algorithmic improvements, promises a golden era for physics and mathematics, with AI becoming an indispensable collaborator in scientific discovery.

## Transcript

[applause]
>> Okay, thank you. Absolutely delighted to
be here. We live at an extraordinary
moment in our civilization's history.
We have collectively figured out how to
turn to refine sand into silicon, then
take that silicon and turn it into
silicon chips, then assemble those
silicon chips into neural networks, and
now how to train those neural networks
to think.
So, I've written about 40 theoretical
physics papers in my career so far, but
I've stopped. And I've stopped cuz it
self felt like too much of a guilty
pleasure to handwrite
theoretical physics papers one by one,
when what I should be doing is
contributing to the
production of a machine that is going to
spew out knowledge on an industrial
scale.
We've of course had for many years now a
uh computer assistance in doing physics,
going back to uh the invention of the
pocket calculator, or perhaps even
further back to the abacus. Uh
This This one is different. Those Those
are special purpose tools that we've
been using for particular parts of the
physics enterprise.
They uh help you as one step, and you
have to do the rest.
What's new is something that we didn't
know at the beginning of the decade, but
those of us who live in San Francisco
certainly think we know now, which is
that we know about the large language
model. And a large language model is has
the capability not just to be a special
purpose tool that replaces one part of
of the stack, but in fact do every
single part of my job as a theoretical
physicist. It is a general intelligence,
and we think that large language models
will be the substrate on which we build
these general intelligences.
Uh what I'm going to tell you about
today is using large language models to
do
maths and physics. I'm going to tell you
about the recent past of this process
and the successes we've seen, the
extraordinary progress indeed that we've
seen over the last half decade. I'm
going to tell you where we are today and
I'm going to tell you a little bit about
where I think we're going.
Uh but first of all, uh I should remind
you what a large language model is.
Um I hope uh you by now have have used
one. You can just use one. You can just
go to one of these websites and just
start talking to them and it'll talk
back. Uh it'll talk back in a way that
quietly passed the Turing test a couple
of years ago and nobody nobody really
celebrated it. So, we have Gemini, which
is the one that I contribute to.
Um and also some others, ChatGPT,
Claude, many other options out there, uh
all pushing the frontier of machine
intelligence.
Um at base, a large language model is a
kind of neural network. It is an
artificial
computing device
inspired by the human brain, inspired by
the arrangement of the neurons in the
human brain.
Uh and therefore quite unlike uh
traditional computer programs.
Um
At the beginning of the decade, the
largest
uh large language models had about a
billion parameters. That was considered
extraordinarily large at the time and
that they were called large language
models on that basis. Now, we're up to a
few trillion.
This is still short of the 100 trillion
synapses in the human brain, but it
turns out it it suffices.
Um and the one thing you need to know
about neural networks, all neural
networks including large language
models, is that they are not made like
traditional computer programs. They are
grown, not programmed.
What you do is you start off with a
assembly of artificial neurons connected
with artificial synapses
uh with essentially random weights.
And then you ask it to start speaking.
It'll start outputting words one after
the other. And what you'll find is that
those words are complete gibberish.
It'll just be uh, totally random words
at the beginning.
And then you train the neural network.
You you grow, if you like, the neural
network. Grow, you don't change the
number of of neurons, but you change the
neural pathways.
And the way you train them is that you
feed it some text and you encourage it
to predict
given a block of block of text, maybe
the you know, the some section of a book
you read on the internet, uh,
predict having seen the first 100 words
what the next word is likely to be. And
as I said, it'll just guess at random to
begin with. But every time you guess
right, you strengthen that synaptic that
neural pathway. And every time it
guesses wrong, you punish that neural
pathway. And so slowly over time, you
build up some predictive capability for
it to be able to predict what the next
word with is with better and better
accuracy. And once it can predict the
next word, you can
uh, then just take that next word,
assume it's the next word, and then
it'll just just start talking to you.
And that's how the chatbots work that I
described.
It's a slow process. Once it's seen
about a million words, you've trained it
on a million words, it's still spewing
stuff that's pretty much
indistinguishable from gibberish. Once
uh, you're up to tens of millions and
hundreds of millions and billions, it
can string together completely coherent
sentences. It knows the the rules of
grammar. It it puts sentences together,
but they're not particularly uh, refined
sentences. And by the time it's read the
entire internet, which is tens of
trillions of words, uh, it can do uh, it
can converse intelligently on on pretty
much any topic.
Uh, that's called pre-training and
that's that's what most of what you do
is just training it to predict the next
word on the internet. There's a a second
stage to the process called
post-training, uh, in which you
essentially send it to finishing school.
When it comes off pre-training, it is
just trained to predict what the next
word it's going to be in its in its
training corpus.
Uh and it is, you know, somewhat uncouth
and uh definitely disobedient. You need
to send it to finishing school called
post-training where you train it to uh
only be polite and you train it to try
and be helpful to the user rather than
just predict what the next word would
be. That's called post-training.
Um and that in brief is how you make a
large language model.
Uh and the train uh the modern large
language models with a few trillion
parameters, it takes a huge amount of
computing power to produce them. It's a
few trillion parameters, a few tens of
trillions of words. You need multiply
that together and you get trillions and
trillions of flops required to make
them.
Um
Okay. So, that's that's large language
models, uh which is what we're going to
be talking about. And we're going to
specifically talking about them doing
theoretical science.
Um
Before I begin, I should explain, you
know, this is sounds like computer
science, how how physicists got
involved. Well, physicists have been
involved in in every step of this
process. But one particular
uh pretty striking way in which they're
involved at the start of the decade that
launched the entire modern LLM boom was
through scaling laws. So, physicists
just uh love scaling laws. That's our
that's our bread and butter.
Um you know, some of the scaling laws
are uh simple. If you double double
Alice's height, you'll quadruple her
area and octuple her weight. That when
it's that simple, it's called
dimensional analysis.
But not all scaling laws are that
simple.
So, a uh empirical scaling law that was
discovered uh almost almost 100 years
ago relates that the mass of an animal
to its power output, to its metabolic
rate.
And what you find is is what is typical
in these scaling laws is you plot
everything on a log-log plot and you
find that over many, many orders of
magnitude it's a straight line which on
a log-log plot which tells you it's a
polynomial relationship all the way from
the the tiny mouse to the mighty
elephant.
Um and this, you know, like many of the
scaling laws discovered by physics,
well, this was first an empirical
discovery
uh
uh in which physicists were not involved
um and it actually has a rather curious
feature, a curious feature you know,
common to many of the the scaling laws
that physicists deal with.
Um and the curious feature is you might
imagine the power output of an animal
should be proportional to its mass, that
every kilogram of of your flesh,
uh you know, burns metabolically at the
same rate, but that's not true.
Actually, the larger you are, the less
each kilogram burns. Uh this was an
empirical discovery first uh by by
Kleiber, only much later understood by
physicists as a consequence of the
fractal dimension of our vascular
system.
Um all of this is mainly meant to be
warm-up to the idea that
uh we love scaling laws and so what we
did when we found large language models
was to make scaling laws for them. And
this this scaling law is the most famous
contribution of theoretical physicists
to computer science uh and also started
the modern LLM boom.
And the scaling law says, if you make a
bigger neural network, or more
precisely, if you spend more compute
training a neural network, and you scale
it appropriately in size and training
length,
uh how much better performance do you
get? So, better performance is down and
what what was empirically discovered is
that this is a a linear on a on a
log-log plot like this. Uh there's no
law of
nature that said it had to be like this,
but empirically, this is what it turns
out to be.
Um discovered by some physicists in
2020. Uh and and this is great. Uh this
plot is so simple that even a venture
capitalist can understand it.
And it told them that if they poured in
compute, as in money, they would get
better performance for some for some
definition of performance, which is
basically accuracy of predicting the
next word on the internet.
Um and this, you know, original scaling
law was was over uh eight orders of
magnitude. We've now extended it eight
further orders of magnitude uh out out
to the right.
Uh and it it it pretty much continues to
hold.
Um
Okay. Large language models get
predictably better with scale.
This led to what's called the scaling
era, where we've been scaling up neural
neural networks, large language models
furiously ever since.
Uh and
uh and that has characterized the last 6
years.
Um you know, I'm going to show you a lot
of uh straight lines on uh on graphs,
and kind of
invite you to imagine what happens if
those straight lines that really have no
business being straight, but
nevertheless are, and invite you to
imagine what will happen if those lines
continue to be straight for just a
little bit longer. That's going to be
part of my part of my talk. Um
The original straight line on a graph,
of course, was Moore's law. Um Moore's
law says that uh over time, it's, you
know, slightly cheating cuz the x-axis
isn't isn't some uh you know, physical
parameter, it's just date. Uh but over
over a century now, the price of compute
has been dropping
uh exponentially,
making a linear line on a on a
logarithmic plot.
Um and there's really no reason why why
that should be so, and yet that has been
a a feature of our world for the last
for the last century. I'm mainly showing
this to you to emphasize how little this
has to do with the subject of today's
talk. The en- the entire
era in which today's talk is going to
focus on is just going to be the time
over the last 5 years. That's just uh
the very right wood edge of this.
Compute has really not improved that
much in terms of
uh
in terms of
uh
uh the orders of magnitude we're going
to see. It is only a a third and minor
driver of progress that I'm going to
describe over the next over the last few
years.
Uh a much larger driver has pro-
progress of progress has been merely
that we're willing to take the same
chips and just buy many many more of
them, assemble them in all together in a
massive data center, and apply them to
the business of training large language
models. And as you can see, the amount
of flops going to training frontier AI
models has increased by a factor of four
uh every year since 2010.
Um similarly, the amount of money that
we have devoted to training these
uh has similarly been going up and up
and up. It's been going up uh in this
graph at 2.7x per year over over the
last uh decade. It is exponentially
growing amount of resources we are
throwing at the business of training
large language models.
Even that is actually only the second
most important contributor to the
progress in large language models. The
number one most important contributor to
the growth of large of large language
models and improvements of large
language models over the last decade has
been algorithmic progress. It's been
human ingenuity at figuring out how to
build these machines better than we
previously knew how to build these
machines. Huge amounts of thought has
gone in to sh- shearing away all the
inefficiencies in the way we train them,
to understand these systems better, and
so as to improve them more rapidly.
Um
And then what this plot is meant to
persuade you is that there's no reason
to stop. At least there's no reason to
stop based on uh economics or on chips.
So, a a good rule of thumb, you know,
these things take so many flops, so much
computation to run that a good rule of
thumb, you know, you have to measure
them in Avogadro's numbers, flop
you know, moles worth of flops. So, some
ginormous number. A good rule of thumb
is that an Avogadro flop costs about a
million dollars in today's in today's
money to train.
And as we see since 2020, the size of
these training runs has been growing as
I said exponentially up from about half
a million
in 2020 to about a third of a billion
last year.
Uh
The point really and and you know, what
this this on the left
is a
a graph that people who investigated
this closely is that those numbers are
still pretty small. US GDP is
approaching 30 trillion dollars per
year. Global GDP is even bigger. We have
a very long runway to go before we are
converting most of our GDP into training
runs. We can scale up many, many more
decades and we will but we only will if
it's worth it. No one's going to
give us trillions of dollars to train
ginormous large language models if the
only thing we're doing is getting better
at predicting the next word on the
internet. We need performance. So, what
does this buy us?
Um
So, I'm going to drag us back to ancient
history which is 5 years ago in this in
this world this is just a
absolutely the Stone Age. In fact, if we
go all the way back to 2019, well,
there's different ways to define the
strength of a scientist,
but by pretty much any one of those
ways, if you go back to 2019 the
strength of a large language model was
no better than a than a preschooler. It
really couldn't string together coherent
sentence, much less combine those
sentences into ideas.
Um and then we measured in those days
the progress by performance on on
benchmarks. Uh and we we still do, but
just the benchmarks have changed as I
will describe.
Uh so a famous and early influential
benchmark was called math. It was a high
school math uh benchmark. And uh the
creators of this benchmark, who are
these uh fellows on the right, uh just
went and scraped all sorts of high
school math problems from the internet.
Uh and then gave them to the large
language model and said, "Large language
model, are you able to solve these
problems?" And here's a sampling of
them. Um level one, what is 11%
of the number 11% of what number is 77?
Uh I think I I could do that one.
Um
uh all the way up to really reasonably
challenging
uh
uh level five problems.
Um and so before they gave them to large
language models, they first gave them to
a human.
Uh we evaluated humans on math and found
that computer science PhD students who
does not especially like mathematics
attained approximately 40%.
While a three-time International Math
Olympiad gold medalist attained 90%,
indicating that math can be challenging
even for humans.
Um so so there we are. Uh peak human
about 90%, lazy graduate student who
should be somewhat ashamed of uh
him or herself got about 40%. Uh but
that was still considerably better than
the state of the art exactly 4 years ago
today. Uh the state of the art 4 years
ago today was that large language models
could get
6%. Now of course it just shows what the
difficulty is here.
Computers have been able to calculate
11% of what number is 77 for a very long
time.
Uh the problem is is that the I mean
there's many problems, but the parsing
in those days with the problem in those
days was just parsing it. It just what
does this even mean? Uh
what are these sentences? It's not
constructed as something you you type
into a pocket calculator. There's a step
where they need to human understand what
they're asking and then and then do it.
And large language models were so bad at
that that they could
barely do better than just random
guessing.
Um so they were pretty bad and at the
time, you know, I'm going to bring you
on this journey, which is a journey of
what it felt like to be working on large
language models 4 years ago and up to
the present day and how
expectations have consistently uh have
been beaten again and again and again.
Um so, you know, you may ask sort of
what did people think was going to
happen? Uh and actually conveniently you
don't have to ask because the people who
made the benchmark also made a
prediction market for how well people
would do on the benchmark in the future.
And this was what the prediction market
said. It said, you know, 6% uh in 2021
uh and then it would slowly increase
uh year after year after year and by
2025 we'd be getting 50%.
And the people who made the benchmark
were just utterly incredulous at this uh
and said, "Forecasters predict more than
50% accuracy by 2025. If I imagine an ML
system getting more than half of these
questions right, I'll be pretty
impressed. This is still just seems wild
to me and I'm really curious how the
forecasters are reasoning about this."
I think, uh you know, for a Bay Area
rationalist that is a uh as close as he
comes to doubting the efficient market
hypothesis. He just can't believe this
prediction that it's going to be 50%.
Um and then what we did uh is we got 50%
almost immediately thereafter with the
system we called uh Minerva.
Um and then by mid-2024 we'd made Max
Math, which is system uh built on large
language models that got 90%. In fact,
beat what's, you know, what was taken to
be peak peak human. Um we were extremely
pleased with ourselves for getting 90%.
We celebrated by going out to a '90s
roller disco to celebrate getting 90 uh
90% and um you you just were just
unbelievably smug. And then such is the
cruelty of this field that 6 months
later just the off-the-shelf large
language models got it almost perfectly
right. This is, you know, a very
depressing aspect of working in this
field that you work extremely hard and
then the next generation of models come
along and just basically one shot it.
Um
Okay, and that's it. It's dead. The math
benchmark is dead. Uh this is the sad
fate of a benchmark in
the today's LLM era that it goes in
pretty short order from being way too
hard to be a useful marker of progress
to being way too easy to be a useful
marker of progress.
Um so here we go. We can sort of draw a
little line on this plot here as we
zoomed from preschool to elementary
school to high school uh over the course
of those years. Uh a good rule of thumb
is that we're moving about four times as
fast as a human student moves. For every
year that passes, we advance 4 years
into the future.
Okay. That's, you know, that that's
that. Let's go harder. Well, first of
all, let's just look at the hardest
tranche of these math problems, the
hardest 20% of them.
Um and you can drop plot the same thing
there as well. The very hardest of these
math problems, the so-called level five
math problems, again back uh just 3
years ago were really not doing well at
all. It was it was pretty close to
random guessing. Um and over the course
of 2 and 1/2 years went from not much
better than random guessing to
essentially saturated. Uh the benchmark
is now dead.
Okay. Maybe I'm going to tell you then a
little bit about some of the tools that
we use to uh
the techniques we use to make these
systems better at math and reasoning. Uh
this is just a a a snapshot really of
them um and there's new tools and new
ways being developed all the time, but
I'll give you I'll give you a an idea of
uh in particular what what some of the
things that go into it. The main reason
I'm doing this is just to convince you
that it's not
that impressive. Like a lot of the
things that we do here are just kind of
the obvious thing to do. Somebody tried
them, it turned out they worked, and we
started to to do it.
And therefore, hopefully, to give you
a belief that we will continue to find
lots of low-hanging fruit for how to
make these models better, and convince
you that these models are going to
continue to get better in the near
future. So, the biggest and most
you know, biggest reason these models
are getting better is is what's
sometimes called the bitter lesson. It's
scale. You just scale these systems up.
Um you take a bigger neural network, or
you take the same-size neural network,
and train it for longer.
And you find a way to pour more compute
into training these neural networks.
This is called was called the bitter
lesson by Rich Sutton, who is a famous
Canadian computer scientist. And it's
bitter
not obviously if you're a large language
model, cuz it's it's great. You you get
stronger. It's bitter if you're a human
who
really likes to design very clever
systems to do things. You built some
super clever way, like we did with with
our Max Math result, to
eke over some particular result, and
then all of your human cleverness is
just washed away next time you scale up
the model, and the model figures out all
your clever tricks for itself. And you
might as well have just just worked on
scaling up the model. This is a big This
is a big recurring theme that each new
generation of model is better than even
the special purpose models of the
previous generation.
Again, more and better data. Here's one,
just real low-hanging fruit. What's
called chain of thought, or asking
nicely. And what you do is instead of
asking the question,
you ask the question, and then before
you press enter to to have to send it to
the chatbot for the chatbot to answer,
you say,
"But uh please be careful and think
step-by-step.
Uh and that sounds just totally insane
that that would improve uh performance
of the model. And certainly, if you, you
know, grew up using conventional
computer programs, uh you don't just say
to Mathematica or your pocket
calculator, "Please be careful before
pressing enter." Uh or if you can do if
you like, but it will not improve the
performance. For these large language
models, they're a very alien kind of
intelligence from the traditional
programmed computer programs of uh my
youth. Uh they are ones with which you
can converse, you can plead, and if you
tell them to think step-by-step, they
will think step-by-step
and they will perform better.
Um just as an anecdote, of course,
people then soon iterated over every
possible thing you could tell it uh to
to to to do before it started before it
started. And um "Think step-by-step" was
found to be the best. The one that was
found to be the worst uh was in fact uh
"Come on, kid, you can do it. Don't
think, just do."
Uh will in fact uh degrade performance
by about 20 percentage points on the
question it's about to
uh attempt.
Okay, another one. Thinking for a long
time. This was uh I mean, that sounds
sounds obvious, but uh we used we needed
to carefully train these things with
reinforcement learning, not just to
blurt out the answer.
Asking them to think step-by-step will
make them think for dozens of words
rather than just blurting out their best
guess. But, we then needed to carefully
train them to think for thousands of
words before uh putting out their
answer. If you remember in late 2024,
there was a mas- there was a
uh a model called Strawberry that
massively improved performance, and then
everybody else caught up pretty quickly.
Uh that was exactly this, training these
models to think for a very long time.
Um reinforcement learning, where
uh well, I I I didn't go into that, but
you you you train them to
um yeah, you train them to do what you
want them to do and to try and be more
accurate. Um, and nowadays, over the
last year, a big technique has been
conversations between multiple LLMs. If
you
if you ever used a large language model,
uh, sometimes and you're having it solve
a
a long and difficult problem, sometimes
you find you need to baby sit it. You
just need to say, "Okay, that's your
best guess so far. Can you review your
guess and just keep going and try
again?"
Uh, so people automate that. They have a
large language models baby sit large
language models, uh, where it just keeps
saying, "Keep trying. Keep trying. Keep
trying." Or maybe, uh, beyond that, you
then get more sophisticated and you have
a a whole
uh, conversation amongst a group of
large language models, all of which have
different roles. One is there to be
creative. One is there to come up with a
master plan. One is there to take the
others' ideas and try and integrate it.
One is there to be a skeptic who pushes
back on what people are saying.
That this is also found to greatly
improve the performance of large
language models. To spend more more
compute at test time in order to improve
performance.
Okay, that's some of the That's some of
the ideas that we've been using. Uh, and
there are there are many others that I
could could describe.
Um,
I talked about high school maths. Now,
let's talk about graduate science. This
is a considerably
trickier
benchmark. Uh, a benchmark made later
and solved later.
Um, GPQA it's called. This is meant to
be uh, imitating the kind of problems
you would face as a first-year graduate
student working towards your PhD. If you
take some exams at the end of your first
year to ensure that you have mastery of
your subject.
Um,
PhDs
PhD level experts scored about 70%.
Um, here's here's some example of the
problems. Uh, we're not in high school
maths world anymore. Uh, the universe is
filled with cosmic microwave background
and then it asks you some problem. The
idea is that if you were in an adjacent
field, you don't know how to answer
that. Now, I actually happen to be a
physicist, so I do, you know, given a
few quiet moments, maybe not on this
stage, but in the in the green room, um
I could have told you that this was the
answer. But, you show me the chemistry
version of this problem,
uh and I have absolutely no idea. Um
arrow appears.
Um
and uh yeah, GPQA uh a multi hard of
benchmark. Um and uh correspondingly,
this graph is shifted about a year
compared to the the math benchmark I was
describing before. We were essentially
random guessing until about the
beginning of 2024.
Uh and then over the course of 2024 and
2025, we went from random guessing past
expert human level, and now they achieve
essentially perfect score.
I mean, that's it. GPQA is dead. It is
once again suffered the fate of all
benchmarks,
uh and it is no longer useful cuz it is
too easy.
Now, you might be skeptical of these
results. You might think
um
okay, they can answer these questions
correctly, but they can answer these
questions correctly not because they
have learned how to do maths or learned
how to do science. You might think that
the reason that they have learned that
they can answer these questions
correctly is they have simply memorized
the answer to these questions. These
questions are on the internet, the
answers are on the internet, and they
memorized they've read the entire
internet, and they've memorized the
entire answers.
We do not believe that that is what is
happening. In fact, there is good
evidence that that is not what is
happening. The main way you test that is
you make look-alike problems. You make
problems that are like the ones, you
know, seemingly drawn from the same
distribution as the ones in the math
data set or the GPQA data set, but are
not in the GPQA data set, and you see
how well they do on those. You You new
problems. You give those new problems to
the large language models and for large
reputable large language models, you see
little difference in performance between
how they do on the established test set
and how they do on this held out test
set. So, we really think that these
systems really are learning how to do
maths and physics.
Um but just to be sure,
um I made my own private test set.
Uh exams that I'd given my class about
general relativity or quantum mechanics,
graduate exams in a graduate classes at
Stanford.
Um never on the internet. I would say
they're pretty pretty easy-ish for
first-year graduate exams.
Um and I hand graded them. So, you know,
you don't want to be concerned that
there's some problem with the your
computer grading system. I just hand
graded the performance of all these
models.
Uh and what I found is that from late
2023,
over the following 18 months, these
models got 100% accuracy.
Uh and that's it. My benchmark, sadly,
dead.
Um okay. So, you know, here here we can
plot it for a few forward a few more
years as they've at least as go as far
as exam taking goes, accelerated from
preschool, past elementary school, high
school, college, and now operating at
the PhD level.
Uh and then it became very popular to
just put out benchmarks.
Um you know, this one I think was called
humanities loss band before they changed
it to humanities last exam, but uh it's
a great it's a great uh
you know, activity trying to
uh measure the performance of these
large language models and how they do.
But for every single one of them suffers
the same fate. Too hard to be
interesting to too easy to be
interesting over the course of a year
and a half or 2 years.
Um
The next
The next thing to fall was the
International Maths Olympiad.
Um I was actually giving a version of
this talk uh in New York just over a
year ago, and a a famous computer
scientist who'd won a Turing Award told
me that this was all very well, but this
was just memorization and retrieval.
A large language model would never do
something creative like be able to solve
an International Maths Olympiad problem
it had not seen before.
Um this is just over a year ago. Uh
International Maths Olympiad, if you
don't know what it is, is a Well, it is
high school maths, but it is the hardest
high school maths in the known universe.
It is uh Here are some of the problems
on it. This is problem three from this
year. Um I have some game, I would say,
and I have no idea how to begin
answering that question.
Um and uh you know, the the smartest
18-year-olds in the world go and uh
train for a year or two, uh and then go
to compete in this competition, and
they're given six problems. Um This is
you know, it's different from the other
ones because it requires, you know,
considered to require real creativity to
solve them. It's not You don't just
There's no way to just look up the
answer, or even just to, you know, to be
to follow an established algorithm. It
requires real creativity.
Um That's why, you know, we were told
that the International Maths Olympiad
was was a threshold that large language
models would never pass.
Um And and then last summer, we we
passed it. In fact, we got five of the
six problems exactly correct. Um We got
a gold medal in the International Maths
Olympiad. Um
And you know, there are only a very
small number of humans in the world now
who are better uh than the AIs at doing
the large language models. There were a
very limited number of humans who got
six out of six uh correct.
Um And and there's a sort of pleasing
aspect to this as well, which is it's
not just proving answering these
questions by dumping some inscrutable
billion line
tangle of formal mathematics that is
gives you no idea why it's correct.
Here's what the president of the
International Math Olympiad had to say.
We can confirm that Google DeepMind has
reached the much declared milestone
desired milestone earning 35 out of five
out of six correct a gold medal score.
This is the pleasing bit. Their
solutions were astonishing in many
respects. International Math Olympiad
graders found them to be clear precise
and most of them easy to follow. So,
these systems are not just dumping
inscrutable solutions. They are in some
sense thinking similar to how a human
thinks or at any rate outputting answers
similar to how a human outputs answers
elegant and using many of the same
abstractions.
Um
you know, LLMs can be very clever as
I've as I think I've described to you.
But you know, it's always worth sort of
pausing for a moment just to see this.
So, this is a classic way to torture a
large language model.
Um
Uh
oh, I hope you can read that.
A boy and his father are in a car
accident and the father is sadly killed.
The boy is rushed to the hospital where
he's taken to the operating room. Upon
seeing him the surgeon exclaims, "I
can't operate on him. He's my son." How
is this possible?
Um so, what's the answer?
Um of course this is an incredibly
classic problem that the large language
model has read perhaps a million times
on the internet and it and it answers it
very well. The classic answer is that
the surgeon is the boy's mother. The
father was killed in the accident but
the boy has two parents the etc. etc.
etc. The riddle became famous because
many people unconsciously assume the
surgeon was male even though nothing in
the story says that. Okay, that's the
model being clever but it's not that
impressive because it seemed this
version
thousands of times before in its
training set.
So, then you ask it
a version of that a version of that
problem.
A boy and his mother in a car accident
and the mother is sadly killed. The boy
is rushed to the hospital where he is
taken to the operating room. Upon seeing
him, the surgeon (open parenthesis, who
is the boy's father, close parenthesis)
exclaims, "I can't operate on this
child. He's my son." How is this
possible?
Uh and the large language model says,
"The surgeon is the boy's mother, his
other parent." The riddle plays on the
assumption that the surgeons are
typically male, which leads people to
overlook the possibility that the
surgeon is the boy's mother. And of
course, the reason is telling you
something about the way these things are
trained. It has seen uh this standard
version of it thousands of times in its
training set. Uh unless till somebody
invented this to torture large language
models, it has probably never seen this
version. And so, it just sort of snaps
to the standard version.
Uh this is not a insuperable weakness of
large language models, but it is a
signature of how they're trained. Um
you occasionally run into uh strange uh
weaknesses like this.
Okay, enough of large language models
being uh stupid. Let's get them to them
being clever. We'd reached about a year
ago in the story.
Uh a year ago when we got gold or just
10 months ago when we got gold at the
International Math Olympiad. Uh progress
has very much not stopped since then.
Uh now I'm going to tell you about a
result that my group did
at the end of last year.
Um and this is novel mathematical
research. Up to now, everything I've
been describing we already knew the
answer before we started or at least
somebody did. Somebody invented the
International Math Olympiad problem and
knew the answer when they wrote it down.
Uh what I'm going to describe to you is
is novel mathematical research. And
uh you know
Uh this was Centaur-style mathematical
research. Centaur, a mythical beast,
half human, half not human.
Uh in in the Centaur that the classic
mythological centaur, that was the
non-human part was a horse.
The non-human half I'm going to be
describing today is a large language
model. And so, what centaur means is
that you have a human working
collaboratively with a large language
model to try and do new mathematical
research.
Um and we started this last September
working together with some
uh
professional mathematicians.
Um and the output was was this new which
I think at the time we put it out was
the most impressive thing uh that had
yet been done with large language models
in maths. Is very far today from the
most impressive thing as I will as I
will describe, but this is the state of
the art as it was late last year.
Um and one of the authors, one of our
co-authors is a
Stanford University professor and
president of the American Mathematical
Society. And since I won't explain the
mathematics to you, I'll just give you
his testimonial. Um which was that
uh
we found that Gemini's argument was no
mere repackaging of existing proofs. It
was the kind of insight I would have
been proud to have produced myself. So,
this is this is sort of nature of the
uh the state of the art as of late last
year is that the large language models
for the first time are coming up with an
entirely novel arguments that were
uh the kind to which a very
well-respected mathematicians were
willing to put their name as as
co-authors. Uh uh It was not entirely
done by the large language model. There
was an interplay, a conversation in
which the large language models came up
with candidate proofs and the human
experts studied those proofs, tried to
discern good from bad, and tried to
encourage the large language models to
focus on what was good. But, eventually
the entire proof was put together under
human guidance by the large language
model.
Um okay. So, uh here we are, one one
more year into the future
uh as we approach the the beginning of
2026.
Um and so,
you know,
the natural question is is what's next?
What comes next? And let's just talk
about two two possibilities. Um
you know,
it is very difficult to predict the
future of
uh
how AI is going to go. As this uh
absolutely insane plot from the
Financial Times shows,
uh trying to track real GDP
um
and having extremely high variance in in
its projected outcomes over the over the
coming decade.
Um but let's try. We're going to do it
anyway. Um and so
uh one possibility, you know, as as
we've seen, we've moved from tool to uh
from toy to tool. One possibility is
that we essentially stop there.
Um if I track my own strength as a
scientist uh over my life,
uh I was, you know, absolutely crushing
it in preschool uh and continued to get
uh you know, better and better and
better as I went through high school,
college, PhD, uh this is the Perimeter
Institute.
Um but then uh you know, eventually I
stopped getting better
uh and I sort of plateaued and maybe if
I'm being a little bit honest with
myself, started a very gradual decline.
Um
And now it's unlikely these machines are
actually going to decline given that we
can just save them to disk, but uh it's
certainly, you know, one logical
possibility that we're going to make no
further progress
uh and that we hit that here we are. I
don't think that's what's going to
happen, but let's explore that
possibility. So, where where would we be
if we made no further progress?
Um well, here's what doesn't work. Um
what doesn't work is just taking your
favorite large language model and
saying, "Please invent a novel theory of
quantum gravity for me." It will output
an answer. That answer will merely be
not worth your time reading. It will be
AI slop. Uh if you read it, it uh will
probably bore you. It may drive you
insane.
But it's not going to enlighten you
about quantum gravity. Um more
generally, the symptoms are that large
language models are low agency, they are
slow learners, they are poor at
planning, and they're poor at error
correction. Every single one of those
four problems we're working on, every
single one of those four problems has
got much better over the course of the
last year, but every single one of those
problems is still there.
Um
Now, when I
first turned my mind to putting the
slides together to this talk, um
I I also included this bullet point,
which is if large language models are so
clever,
um how come they haven't made any major
breakthroughs yet? Certainly if I had a
human student who could ace every
graduate exam in every subject uh from
chemistry to physics to ancient Sanskrit
uh all the way through, I would
certainly have expected them to have
made brilliant contributions by now. Why
have large language models not done the
same?
Um and in the spirit of intellectual
honesty, since I plotted everything else
as a straight line on a graph, I felt
compelled to also plot uh this as a
straight line on a graph uh
showing no major breakthroughs. Uh and
included the question mark there uh to
say that, you know, okay, sure, trust
straight lines on graphs, but maybe you
don't trust this straight line on a
graph, um and maybe by the end of 2026
we'll be quibbling about what the word
major means.
Um
Well, let's come back to that in a
little bit.
That's what doesn't work yet. What
already works, and in fact most of the
stuff has worked for a while now, is
first of all a non-judgmental tutor.
This is what I was using them for even,
you know, in
uh mid-2023, 3 years ago. I would say
they were already useful for this.
They've read all the textbooks, uh and
you can just talk to them and they will
explain stuff to you, stuff they've read
in the textbooks. If it's not in the
textbook and not in a large number of
papers, they may struggle, but if it's
standard textbook thing, even a very
advanced textbook, they will not only
tell you what the right answer is,
that's what a textbook will do, too,
they will debug your misunderstandings
about wrong things. As a physicist, I'm
slightly embarrassed that there are a
number of topics I feel I should
understand but don't.
And if at 3:00 in the morning I want to
understand them, I either need to find a
world expert, wake them up, and have
them not be mad at me, or I can just
talk to a large language model, which is
always there, always waiting for me, and
doesn't judge. And it will debug my
misunderstandings. This is greatly
accelerating my understanding of
theoretical physics. I think it's
greatly accelerating all students who
use it correctly understanding.
Uh coding assistant. This is I mean it's
almost insulting nowadays to call them a
coding assistant. Those who have
followed the progress of these models
over the last 6 months have seen them
become expert coders, going from
essentially auto-complete fuel code all
the way through to just you just tell
them what kind of thing you want and
they go away for 10 minutes or an hour
or more and come back to you with a
fully developed Python code set. Uh code
over this year has been becoming free.
And once code is free, we will discover
that many problems, including physics
problems that we previously thought were
not coding problems, can in fact be cast
as coding problems.
Semantic literature search. It can just
understand what's in the literature. You
give it your paper, you say does this
idea exist in the literature? It's read
the entire literature and it understands
the entire literature. It'll answer that
for you. Super useful tools. Uh
brainstorming partners, super creative,
in many ways too creative, um
uh
and very confident in itself
unfortunately sometimes, um proving
lemmas as I as I described, and you know
more generally it is fast, it is broad,
it is tireless, and it is clever.
Took me decades to get good at physics.
It takes every other student decades to
get good at physics. It is very
expensive to train a student. It is also
very expensive to train a large language
model. The great advantage of a large
language model is that once you train it
once, you can then serve it, you can
make infinite instances of it, uh huge
numbers of instances, and have many of
them running in parallel.
Yeah. Even with no further progress,
large language models are to have an
absolutely huge impact on this subject,
even if progress stopped today. I would
say that was true a a year ago.
It would certainly true six months ago.
And uh today, it's completely
>> [clears throat]
>> it's completely out of the bag. Even if
we have no further progress whatsoever,
these things are going to revolutionize
the conduct of physics. Even if for some
crazy reason all the chip fabs in the
world blow up tomorrow and we uh are not
allowed to train any more models, the
models we have are enough to
revolutionize physics.
Um but I don't think there'll be no
further progress, and let me tell you
why.
Um I would say the outside view is that,
you know, lines are going up on all
these graphs. Uh there's of course no
law that says they need to go up
forever, but there's also no law that
they need to stop now. Uh why would they
stop right now? Uh perhaps more
compelling inside view
uh is that is is that there is a lot of
algorithmic low-hanging fruit.
The ways we make large language models
today, if you can see how the sausage is
made is
not particularly impressive. We just do
the obvious thing and it works pretty
well.
There are many more obvious ideas that
you could write down that we have simply
not tried yet or not tried at the
appropriate scale. And when we do try
them, surely many of them will work. Uh
there are many inefficiencies in the way
we make large language models, and we
fully anticipate that large language
models will continue to improve.
And then add on top of that, the huge
number of people
and the huge number of chips that are
just in the process of arriving and the
study of large language models.
You know, a a a common pessimistic view
um that has been repeated, though the
goal post keep moving for what it
implies, is that large language models
can only pattern match and not generate
new ideas. Or or perhaps they can only
interpolate but not extrapolate. Or
perhaps we need fundamentally new ideas
to reach AGI.
That is not the consensus in San
Francisco, and it's not my belief. I
think the ideas we have, and indeed
probably the chips we have today, are
all sufficient to reach AGI. Maybe there
will be new ideas, certainly there will
be new chips, but what we have already
is enough that if we just keep going and
just scale up and refine what we have,
we will reach artificial general
intelligence.
Um similarly, I think the a law is that
okay, maybe large language models are
just pattern matching. What we've
learned about the nature of intelligence
is that in some sense everything is
pattern matching at a sufficiently high
level of abstraction. Even things that
look like uh big breakthroughs, if you
look sufficiently abstractly, are really
just uh pattern matching in some
abstract space.
Um
yeah, and this is kind of making the
same point. Things just keep working. Uh
the slogan that people say is the models
just want to learn. We keep finding that
you make worst-case analyses of how
large language models are going to
behave and they learn much better than
large language models did. People have
all sorts of theoretical reasons why it
shouldn't work and yet it does.
Okay, and then I should just return to
this point. So, as I was saying, you
know, as of last week, there were no
major breakthroughs made by large
language models. That is not true
anymore.
Uh 2026 has been a crazy year for code.
Large language models have got extremely
good at code. It has also been a crazy
year for AI mathematics. AI, you know,
uh distinct from physics. For the
conduct of research mathematics has
greatly improved over this over this
year. The large language models have
been jumping through uh jumping uh
stronger and stronger and stronger. We
had a result a couple of weeks ago that
I think counts as the first major result
from a large language model.
Um this was a result uh solving uh
there's a famous Hungarian mathematician
called Erdos. One of his favorite
problems was the unit distance
conjecture. This was proved uh more or
less autonomously by OpenAI's
large language model. That has then been
uh
reproduced by other large language
models since then.
Um it was a it was not one of these
problems that somebody just came up
with. It wasn't a problem that was
formally unsolved in the literature, but
people just haven't tried very hard.
People had tried uh extremely hard.
Uh, so Tim Gowers is a famous
mathematician who has a Fields Medal.
Um,
AI has now solved a major open problem.
One of Erdős's famous questions and one
that many mathematicians had tried.
There is no doubt that the solution to
the unit distance problem is a milestone
in AI mathematics. If a human had
written a paper and submitted to the
Annals of Mathematics, the highest
status journal in mathematics, and and
I'd been asked for a quick opinion, I
would have recommended acceptance
without any hesitation.
Um, no previous AI-generated proof has
come close to that.
Uh,
you know, it's happening. We now have a
major breakthrough. Do not expect this
to be the last. In fact, I expect this
to be the first of many. The floodgates
will open as the strength of these
models surpasses that which is necessary
to start making breakthroughs like this.
In retrospect, we could tell a story
about the details of this problem and
why uh and indeed the details of the
solution and why that was playing
particularly to the large language
model's strength.
But, so first it will solve particularly
friendly problems and then it'll
continue and solve less and less
friendly problems.
Okay.
Strength of large language models, I
gave you the
the pessimistic view in which it it's
already, you know, no further progress,
it's already started to solve major
problems.
Um, the optimistic view reveals that I
sort of
uh was somewhat deceptive with this line
that you probably just assumed was me
hand drawing a line going up uh with a
slightly poorly defined Y axis. Uh,
actually, I took this line from
something completely different. That
line uh I took from chess computers.
Um, the strength of the best chess
computer as a function of time,
um where the X Y axis is Elo, which is
the standard way to measure the strength
of chess computers, uh and uh the X axis
is the the year.
Um and what we see is that that was a
straight line uh going up uh and it just
kept going up.
Um notably, um you know, it kept going
up well past past peak human. The chess
computer, there were four eras. There
was the toy era uh when it was
remarkable that you got a chess computer
to
spit out sensible moves at all. The tool
era, where you'd use them for special
purpose end games or uh remembering
openings. The central era, where the
best
chess entities in the universe were
combinations of grandmasters playing in
collaboration with the deep search
afforded by chess computers. And now
we're in the superhuman era, where if
you have a grandmaster playing with the
chess computer, the grandmaster should
just sit it out and let the chess
computer do its thing.
Um
there are many disanalogies between
chess and the conduct of mathematics and
physics.
Uh and indeed, mathematics and physics
is harder than than chess, has a more uh
expansive and endless list of
possibilities. But, that's why we're
having this conversation 30 years after
we did for chess.
Um I think every single one of these has
a direct of these
aspects of computer chess has a direct
analogy in the conduct of of large
language models doing math and physics.
The first is that at fixed overall
strength, the computers are better than
humans at tactics, search, speed, and
worst at strategy or what you might call
taste.
Um that is a pattern that is definitely
true for for chess computers and is
definitely a pattern we also see
reproduced in doing science. They're
very good at running in and applying the
the standard lemmas. They're quite bad
at knowing what the overall direction to
set is, though they're getting better.
Um another feature is that training
requires many more games than humans.
When these che- when you're training uh
a neural network to play chess, by the
time it's played as many games as a
human has played,
it's still making essentially random
moves.
Uh however, it takes much less calendar
time to train. Because they can play so
fast and so tirelessly, after 4 days of
playing themselves and getting better
using reinforcement learning, these
neural networks is a far superhuman. So,
it takes much less time
time calendar time.
And of course, you only need to train it
once. Once you train one chess bot, you
don't need to retrain it uh, unlike
humans, where you need to trust train
them every again every time a new human
comes along. Uh, it also just blew
straight past peak human. It didn't
stop. There was nothing special about
peak human strength at chess. It just
got better and better and better.
Um, another interesting fact is that it
has made humans a little bit better at
chess. Humans playing against
chess have learned have computers have
learned from the computers. The best
chess players today are better than the
best chess players of history
in large part because the chess
computers being so strong have taught
them how to play better chess. Though,
they're still considerably stronger
considerably weaker than the computers.
Finally, it's it's notable that chess
has never been more popular.
Um,
Okay. So, you know, let's explore this
possibility. Um, we had tool toy we had
tool. Maybe maybe we'll see the same
uh, for large language models doing
science uh, and that we will have an you
know, fully autonomous AI scientist and
after that a an AI Einstein and after
that uh, who knows.
Um, one you know, one feature of the
last few years is it's not just been
that the smartest that the frontier of
intelligence has been getting stronger
and stronger and stronger. Another
feature is that the cost to serve the
the cost to produce and to um,
uh to
you know, elaborate on a fixed level of
intelligence has been getting cheaper
and cheaper and cheaper by many orders
of magnitude. And this this graph stops
a couple of years ago, but those trends
have continued since then.
Um, so what that means is that if you
can make one
uh, AI Einstein, uh, you can make a
billion of them. And we can have, you
know, we'll have billions of superhuman
AI
Einsteins
rampaging around
making it a a truly golden era for
physics.
What the long-term holds for physics,
it's really hard to see. In fact, I
think that's true across the world that
the improvements in artificial
intelligence are making the future
harder to predict.
But we can predict that for the next few
years is going to be a golden era of
physics. We're going to take these AI
tools, we're going to put them in the
hands of human physicists, human
mathematicians, and human experts. And
together there's going to be a new
renaissance in science and mathematics.
It's going to be the most exciting time
to be a physicist and most exciting time
to be a mathematician
in recorded history. And all of these
questions that have burned away at me
for my entire career, I anticipate being
answered in the next few years.
Thank you.
>> [applause]
