[0:01] [applause]
[0:03] >> Okay, thank you. Absolutely delighted to
[0:05] be here. We live at an extraordinary
[0:08] moment in our civilization's history.
[0:11] We have collectively figured out how to
[0:13] turn to refine sand into silicon, then
[0:16] take that silicon and turn it into
[0:18] silicon chips, then assemble those
[0:20] silicon chips into neural networks, and
[0:23] now how to train those neural networks
[0:25] to think.
[0:27] So, I've written about 40 theoretical
[0:29] physics papers in my career so far, but
[0:32] I've stopped. And I've stopped cuz it
[0:35] self felt like too much of a guilty
[0:36] pleasure to handwrite
[0:39] theoretical physics papers one by one,
[0:41] when what I should be doing is
[0:42] contributing to the
[0:44] production of a machine that is going to
[0:46] spew out knowledge on an industrial
[0:48] scale.
[0:49] We've of course had for many years now a
[0:52] uh computer assistance in doing physics,
[0:56] going back to uh the invention of the
[0:58] pocket calculator, or perhaps even
[1:00] further back to the abacus. Uh
[1:04] This This one is different. Those Those
[1:06] are special purpose tools that we've
[1:08] been using for particular parts of the
[1:11] physics enterprise.
[1:14] They uh help you as one step, and you
[1:16] have to do the rest.
[1:18] What's new is something that we didn't
[1:21] know at the beginning of the decade, but
[1:24] those of us who live in San Francisco
[1:25] certainly think we know now, which is
[1:27] that we know about the large language
[1:30] model. And a large language model is has
[1:33] the capability not just to be a special
[1:35] purpose tool that replaces one part of
[1:39] of the stack, but in fact do every
[1:41] single part of my job as a theoretical
[1:44] physicist. It is a general intelligence,
[1:46] and we think that large language models
[1:48] will be the substrate on which we build
[1:50] these general intelligences.
[1:53] Uh what I'm going to tell you about
[1:54] today is using large language models to
[1:57] do
[1:58] maths and physics. I'm going to tell you
[2:00] about the recent past of this process
[2:03] and the successes we've seen, the
[2:04] extraordinary progress indeed that we've
[2:06] seen over the last half decade. I'm
[2:09] going to tell you where we are today and
[2:11] I'm going to tell you a little bit about
[2:12] where I think we're going.
[2:14] Uh but first of all, uh I should remind
[2:16] you what a large language model is.
[2:19] Um I hope uh you by now have have used
[2:21] one. You can just use one. You can just
[2:23] go to one of these websites and just
[2:25] start talking to them and it'll talk
[2:26] back. Uh it'll talk back in a way that
[2:29] quietly passed the Turing test a couple
[2:31] of years ago and nobody nobody really
[2:33] celebrated it. So, we have Gemini, which
[2:36] is the one that I contribute to.
[2:39] Um and also some others, ChatGPT,
[2:42] Claude, many other options out there, uh
[2:45] all pushing the frontier of machine
[2:47] intelligence.
[2:50] Um at base, a large language model is a
[2:52] kind of neural network. It is an
[2:54] artificial
[2:56] computing device
[2:58] inspired by the human brain, inspired by
[3:00] the arrangement of the neurons in the
[3:01] human brain.
[3:03] Uh and therefore quite unlike uh
[3:05] traditional computer programs.
[3:07] Um
[3:09] At the beginning of the decade, the
[3:10] largest
[3:12] uh large language models had about a
[3:14] billion parameters. That was considered
[3:15] extraordinarily large at the time and
[3:17] that they were called large language
[3:18] models on that basis. Now, we're up to a
[3:20] few trillion.
[3:22] This is still short of the 100 trillion
[3:24] synapses in the human brain, but it
[3:26] turns out it it suffices.
[3:31] Um and the one thing you need to know
[3:33] about neural networks, all neural
[3:35] networks including large language
[3:36] models, is that they are not made like
[3:39] traditional computer programs. They are
[3:41] grown, not programmed.
[3:44] What you do is you start off with a
[3:45] assembly of artificial neurons connected
[3:48] with artificial synapses
[3:50] uh with essentially random weights.
[3:53] And then you ask it to start speaking.
[3:54] It'll start outputting words one after
[3:57] the other. And what you'll find is that
[3:59] those words are complete gibberish.
[4:00] It'll just be uh, totally random words
[4:03] at the beginning.
[4:05] And then you train the neural network.
[4:06] You you grow, if you like, the neural
[4:08] network. Grow, you don't change the
[4:09] number of of neurons, but you change the
[4:12] neural pathways.
[4:13] And the way you train them is that you
[4:16] feed it some text and you encourage it
[4:18] to predict
[4:20] given a block of block of text, maybe
[4:22] the you know, the some section of a book
[4:24] you read on the internet, uh,
[4:27] predict having seen the first 100 words
[4:30] what the next word is likely to be. And
[4:32] as I said, it'll just guess at random to
[4:34] begin with. But every time you guess
[4:36] right, you strengthen that synaptic that
[4:39] neural pathway. And every time it
[4:41] guesses wrong, you punish that neural
[4:43] pathway. And so slowly over time, you
[4:46] build up some predictive capability for
[4:49] it to be able to predict what the next
[4:51] word with is with better and better
[4:53] accuracy. And once it can predict the
[4:54] next word, you can
[4:56] uh, then just take that next word,
[4:59] assume it's the next word, and then
[5:00] it'll just just start talking to you.
[5:02] And that's how the chatbots work that I
[5:04] described.
[5:06] It's a slow process. Once it's seen
[5:08] about a million words, you've trained it
[5:09] on a million words, it's still spewing
[5:12] stuff that's pretty much
[5:13] indistinguishable from gibberish. Once
[5:15] uh, you're up to tens of millions and
[5:16] hundreds of millions and billions, it
[5:18] can string together completely coherent
[5:20] sentences. It knows the the rules of
[5:22] grammar. It it puts sentences together,
[5:25] but they're not particularly uh, refined
[5:26] sentences. And by the time it's read the
[5:28] entire internet, which is tens of
[5:30] trillions of words, uh, it can do uh, it
[5:33] can converse intelligently on on pretty
[5:35] much any topic.
[5:37] Uh, that's called pre-training and
[5:39] that's that's what most of what you do
[5:40] is just training it to predict the next
[5:42] word on the internet. There's a a second
[5:44] stage to the process called
[5:46] post-training, uh, in which you
[5:48] essentially send it to finishing school.
[5:50] When it comes off pre-training, it is
[5:51] just trained to predict what the next
[5:54] word it's going to be in its in its
[5:55] training corpus.
[5:57] Uh and it is, you know, somewhat uncouth
[6:00] and uh definitely disobedient. You need
[6:03] to send it to finishing school called
[6:05] post-training where you train it to uh
[6:08] only be polite and you train it to try
[6:10] and be helpful to the user rather than
[6:12] just predict what the next word would
[6:13] be. That's called post-training.
[6:16] Um and that in brief is how you make a
[6:18] large language model.
[6:19] Uh and the train uh the modern large
[6:22] language models with a few trillion
[6:23] parameters, it takes a huge amount of
[6:25] computing power to produce them. It's a
[6:28] few trillion parameters, a few tens of
[6:30] trillions of words. You need multiply
[6:31] that together and you get trillions and
[6:33] trillions of flops required to make
[6:35] them.
[6:36] Um
[6:38] Okay. So, that's that's large language
[6:39] models, uh which is what we're going to
[6:41] be talking about. And we're going to
[6:42] specifically talking about them doing
[6:44] theoretical science.
[6:46] Um
[6:47] Before I begin, I should explain, you
[6:48] know, this is sounds like computer
[6:49] science, how how physicists got
[6:51] involved. Well, physicists have been
[6:53] involved in in every step of this
[6:54] process. But one particular
[6:57] uh pretty striking way in which they're
[6:58] involved at the start of the decade that
[7:00] launched the entire modern LLM boom was
[7:04] through scaling laws. So, physicists
[7:06] just uh love scaling laws. That's our
[7:08] that's our bread and butter.
[7:09] Um you know, some of the scaling laws
[7:11] are uh simple. If you double double
[7:13] Alice's height, you'll quadruple her
[7:15] area and octuple her weight. That when
[7:18] it's that simple, it's called
[7:20] dimensional analysis.
[7:22] But not all scaling laws are that
[7:23] simple.
[7:25] So, a uh empirical scaling law that was
[7:27] discovered uh almost almost 100 years
[7:30] ago relates that the mass of an animal
[7:33] to its power output, to its metabolic
[7:36] rate.
[7:37] And what you find is is what is typical
[7:39] in these scaling laws is you plot
[7:40] everything on a log-log plot and you
[7:44] find that over many, many orders of
[7:45] magnitude it's a straight line which on
[7:48] a log-log plot which tells you it's a
[7:49] polynomial relationship all the way from
[7:52] the the tiny mouse to the mighty
[7:54] elephant.
[7:55] Um and this, you know, like many of the
[7:57] scaling laws discovered by physics,
[7:59] well, this was first an empirical
[8:00] discovery
[8:01] uh
[8:02] uh in which physicists were not involved
[8:05] um and it actually has a rather curious
[8:06] feature, a curious feature you know,
[8:08] common to many of the the scaling laws
[8:10] that physicists deal with.
[8:11] Um and the curious feature is you might
[8:13] imagine the power output of an animal
[8:15] should be proportional to its mass, that
[8:17] every kilogram of of your flesh,
[8:19] uh you know, burns metabolically at the
[8:22] same rate, but that's not true.
[8:24] Actually, the larger you are, the less
[8:26] each kilogram burns. Uh this was an
[8:29] empirical discovery first uh by by
[8:31] Kleiber, only much later understood by
[8:34] physicists as a consequence of the
[8:35] fractal dimension of our vascular
[8:37] system.
[8:39] Um all of this is mainly meant to be
[8:40] warm-up to the idea that
[8:44] uh we love scaling laws and so what we
[8:46] did when we found large language models
[8:48] was to make scaling laws for them. And
[8:50] this this scaling law is the most famous
[8:54] contribution of theoretical physicists
[8:57] to computer science uh and also started
[8:59] the modern LLM boom.
[9:01] And the scaling law says, if you make a
[9:04] bigger neural network, or more
[9:06] precisely, if you spend more compute
[9:08] training a neural network, and you scale
[9:11] it appropriately in size and training
[9:13] length,
[9:14] uh how much better performance do you
[9:16] get? So, better performance is down and
[9:18] what what was empirically discovered is
[9:21] that this is a a linear on a on a
[9:24] log-log plot like this. Uh there's no
[9:26] law of
[9:28] nature that said it had to be like this,
[9:30] but empirically, this is what it turns
[9:32] out to be.
[9:34] Um discovered by some physicists in
[9:36] 2020. Uh and and this is great. Uh this
[9:40] plot is so simple that even a venture
[9:43] capitalist can understand it.
[9:45] And it told them that if they poured in
[9:47] compute, as in money, they would get
[9:50] better performance for some for some
[9:51] definition of performance, which is
[9:53] basically accuracy of predicting the
[9:55] next word on the internet.
[9:57] Um and this, you know, original scaling
[10:00] law was was over uh eight orders of
[10:03] magnitude. We've now extended it eight
[10:05] further orders of magnitude uh out out
[10:08] to the right.
[10:09] Uh and it it it pretty much continues to
[10:11] hold.
[10:12] Um
[10:13] Okay. Large language models get
[10:15] predictably better with scale.
[10:18] This led to what's called the scaling
[10:19] era, where we've been scaling up neural
[10:21] neural networks, large language models
[10:23] furiously ever since.
[10:25] Uh and
[10:27] uh and that has characterized the last 6
[10:28] years.
[10:29] Um you know, I'm going to show you a lot
[10:32] of uh straight lines on uh on graphs,
[10:35] and kind of
[10:37] invite you to imagine what happens if
[10:39] those straight lines that really have no
[10:41] business being straight, but
[10:42] nevertheless are, and invite you to
[10:43] imagine what will happen if those lines
[10:45] continue to be straight for just a
[10:46] little bit longer. That's going to be
[10:48] part of my part of my talk. Um
[10:51] The original straight line on a graph,
[10:54] of course, was Moore's law. Um Moore's
[10:57] law says that uh over time, it's, you
[11:00] know, slightly cheating cuz the x-axis
[11:01] isn't isn't some uh you know, physical
[11:04] parameter, it's just date. Uh but over
[11:07] over a century now, the price of compute
[11:09] has been dropping
[11:11] uh exponentially,
[11:12] making a linear line on a on a
[11:14] logarithmic plot.
[11:16] Um and there's really no reason why why
[11:18] that should be so, and yet that has been
[11:20] a a feature of our world for the last
[11:22] for the last century. I'm mainly showing
[11:24] this to you to emphasize how little this
[11:28] has to do with the subject of today's
[11:29] talk. The en- the entire
[11:32] era in which today's talk is going to
[11:35] focus on is just going to be the time
[11:37] over the last 5 years. That's just uh
[11:40] the very right wood edge of this.
[11:42] Compute has really not improved that
[11:43] much in terms of
[11:45] uh
[11:46] in terms of
[11:48] uh
[11:49] uh the orders of magnitude we're going
[11:50] to see. It is only a a third and minor
[11:53] driver of progress that I'm going to
[11:54] describe over the next over the last few
[11:57] years.
[11:57] Uh a much larger driver has pro-
[11:59] progress of progress has been merely
[12:01] that we're willing to take the same
[12:03] chips and just buy many many more of
[12:06] them, assemble them in all together in a
[12:09] massive data center, and apply them to
[12:11] the business of training large language
[12:13] models. And as you can see, the amount
[12:15] of flops going to training frontier AI
[12:17] models has increased by a factor of four
[12:20] uh every year since 2010.
[12:24] Um similarly, the amount of money that
[12:25] we have devoted to training these
[12:27] uh has similarly been going up and up
[12:29] and up. It's been going up uh in this
[12:31] graph at 2.7x per year over over the
[12:34] last uh decade. It is exponentially
[12:37] growing amount of resources we are
[12:39] throwing at the business of training
[12:41] large language models.
[12:42] Even that is actually only the second
[12:44] most important contributor to the
[12:46] progress in large language models. The
[12:48] number one most important contributor to
[12:50] the growth of large of large language
[12:52] models and improvements of large
[12:53] language models over the last decade has
[12:55] been algorithmic progress. It's been
[12:57] human ingenuity at figuring out how to
[13:00] build these machines better than we
[13:01] previously knew how to build these
[13:03] machines. Huge amounts of thought has
[13:06] gone in to sh- shearing away all the
[13:09] inefficiencies in the way we train them,
[13:12] to understand these systems better, and
[13:14] so as to improve them more rapidly.
[13:16] Um
[13:19] And then what this plot is meant to
[13:20] persuade you is that there's no reason
[13:23] to stop. At least there's no reason to
[13:24] stop based on uh economics or on chips.
[13:29] So, a a good rule of thumb, you know,
[13:31] these things take so many flops, so much
[13:34] computation to run that a good rule of
[13:36] thumb, you know, you have to measure
[13:37] them in Avogadro's numbers, flop
[13:40] you know, moles worth of flops. So, some
[13:43] ginormous number. A good rule of thumb
[13:45] is that an Avogadro flop costs about a
[13:46] million dollars in today's in today's
[13:49] money to train.
[13:51] And as we see since 2020, the size of
[13:53] these training runs has been growing as
[13:54] I said exponentially up from about half
[13:57] a million
[13:58] in 2020 to about a third of a billion
[14:01] last year.
[14:03] Uh
[14:04] The point really and and you know, what
[14:06] this this on the left
[14:08] is a
[14:09] a graph that people who investigated
[14:11] this closely is that those numbers are
[14:12] still pretty small. US GDP is
[14:15] approaching 30 trillion dollars per
[14:17] year. Global GDP is even bigger. We have
[14:19] a very long runway to go before we are
[14:21] converting most of our GDP into training
[14:24] runs. We can scale up many, many more
[14:26] decades and we will but we only will if
[14:31] it's worth it. No one's going to
[14:33] give us trillions of dollars to train
[14:35] ginormous large language models if the
[14:38] only thing we're doing is getting better
[14:39] at predicting the next word on the
[14:41] internet. We need performance. So, what
[14:43] does this buy us?
[14:46] Um
[14:48] So, I'm going to drag us back to ancient
[14:51] history which is 5 years ago in this in
[14:54] this world this is just a
[14:56] absolutely the Stone Age. In fact, if we
[14:58] go all the way back to 2019, well,
[15:00] there's different ways to define the
[15:02] strength of a scientist,
[15:03] but by pretty much any one of those
[15:05] ways, if you go back to 2019 the
[15:08] strength of a large language model was
[15:10] no better than a than a preschooler. It
[15:12] really couldn't string together coherent
[15:15] sentence, much less combine those
[15:17] sentences into ideas.
[15:19] Um and then we measured in those days
[15:22] the progress by performance on on
[15:24] benchmarks. Uh and we we still do, but
[15:27] just the benchmarks have changed as I
[15:28] will describe.
[15:30] Uh so a famous and early influential
[15:31] benchmark was called math. It was a high
[15:33] school math uh benchmark. And uh the
[15:37] creators of this benchmark, who are
[15:38] these uh fellows on the right, uh just
[15:43] went and scraped all sorts of high
[15:46] school math problems from the internet.
[15:49] Uh and then gave them to the large
[15:51] language model and said, "Large language
[15:53] model, are you able to solve these
[15:54] problems?" And here's a sampling of
[15:56] them. Um level one, what is 11%
[16:00] of the number 11% of what number is 77?
[16:03] Uh I think I I could do that one.
[16:06] Um
[16:07] uh all the way up to really reasonably
[16:08] challenging
[16:09] uh
[16:11] uh level five problems.
[16:13] Um and so before they gave them to large
[16:14] language models, they first gave them to
[16:15] a human.
[16:17] Uh we evaluated humans on math and found
[16:18] that computer science PhD students who
[16:20] does not especially like mathematics
[16:22] attained approximately 40%.
[16:24] While a three-time International Math
[16:26] Olympiad gold medalist attained 90%,
[16:28] indicating that math can be challenging
[16:30] even for humans.
[16:32] Um so so there we are. Uh peak human
[16:35] about 90%, lazy graduate student who
[16:37] should be somewhat ashamed of uh
[16:41] him or herself got about 40%. Uh but
[16:43] that was still considerably better than
[16:45] the state of the art exactly 4 years ago
[16:48] today. Uh the state of the art 4 years
[16:50] ago today was that large language models
[16:53] could get
[16:54] 6%. Now of course it just shows what the
[16:57] difficulty is here.
[16:59] Computers have been able to calculate
[17:00] 11% of what number is 77 for a very long
[17:03] time.
[17:04] Uh the problem is is that the I mean
[17:06] there's many problems, but the parsing
[17:07] in those days with the problem in those
[17:09] days was just parsing it. It just what
[17:10] does this even mean? Uh
[17:12] what are these sentences? It's not
[17:13] constructed as something you you type
[17:15] into a pocket calculator. There's a step
[17:17] where they need to human understand what
[17:18] they're asking and then and then do it.
[17:20] And large language models were so bad at
[17:22] that that they could
[17:23] barely do better than just random
[17:25] guessing.
[17:27] Um so they were pretty bad and at the
[17:28] time, you know, I'm going to bring you
[17:30] on this journey, which is a journey of
[17:32] what it felt like to be working on large
[17:34] language models 4 years ago and up to
[17:35] the present day and how
[17:39] expectations have consistently uh have
[17:41] been beaten again and again and again.
[17:44] Um so, you know, you may ask sort of
[17:46] what did people think was going to
[17:48] happen? Uh and actually conveniently you
[17:50] don't have to ask because the people who
[17:52] made the benchmark also made a
[17:55] prediction market for how well people
[17:57] would do on the benchmark in the future.
[17:58] And this was what the prediction market
[18:00] said. It said, you know, 6% uh in 2021
[18:04] uh and then it would slowly increase
[18:07] uh year after year after year and by
[18:08] 2025 we'd be getting 50%.
[18:11] And the people who made the benchmark
[18:12] were just utterly incredulous at this uh
[18:15] and said, "Forecasters predict more than
[18:17] 50% accuracy by 2025. If I imagine an ML
[18:20] system getting more than half of these
[18:22] questions right, I'll be pretty
[18:23] impressed. This is still just seems wild
[18:25] to me and I'm really curious how the
[18:27] forecasters are reasoning about this."
[18:29] I think, uh you know, for a Bay Area
[18:30] rationalist that is a uh as close as he
[18:33] comes to doubting the efficient market
[18:35] hypothesis. He just can't believe this
[18:37] prediction that it's going to be 50%.
[18:40] Um and then what we did uh is we got 50%
[18:43] almost immediately thereafter with the
[18:45] system we called uh Minerva.
[18:47] Um and then by mid-2024 we'd made Max
[18:50] Math, which is system uh built on large
[18:52] language models that got 90%. In fact,
[18:55] beat what's, you know, what was taken to
[18:57] be peak peak human. Um we were extremely
[19:00] pleased with ourselves for getting 90%.
[19:02] We celebrated by going out to a '90s
[19:04] roller disco to celebrate getting 90 uh
[19:07] 90% and um you you just were just
[19:11] unbelievably smug. And then such is the
[19:13] cruelty of this field that 6 months
[19:15] later just the off-the-shelf large
[19:17] language models got it almost perfectly
[19:19] right. This is, you know, a very
[19:20] depressing aspect of working in this
[19:22] field that you work extremely hard and
[19:24] then the next generation of models come
[19:26] along and just basically one shot it.
[19:29] Um
[19:31] Okay, and that's it. It's dead. The math
[19:33] benchmark is dead. Uh this is the sad
[19:35] fate of a benchmark in
[19:38] the today's LLM era that it goes in
[19:41] pretty short order from being way too
[19:43] hard to be a useful marker of progress
[19:46] to being way too easy to be a useful
[19:47] marker of progress.
[19:50] Um so here we go. We can sort of draw a
[19:51] little line on this plot here as we
[19:53] zoomed from preschool to elementary
[19:55] school to high school uh over the course
[19:57] of those years. Uh a good rule of thumb
[19:59] is that we're moving about four times as
[20:01] fast as a human student moves. For every
[20:05] year that passes, we advance 4 years
[20:07] into the future.
[20:09] Okay. That's, you know, that that's
[20:11] that. Let's go harder. Well, first of
[20:12] all, let's just look at the hardest
[20:14] tranche of these math problems, the
[20:16] hardest 20% of them.
[20:18] Um and you can drop plot the same thing
[20:20] there as well. The very hardest of these
[20:22] math problems, the so-called level five
[20:24] math problems, again back uh just 3
[20:27] years ago were really not doing well at
[20:29] all. It was it was pretty close to
[20:30] random guessing. Um and over the course
[20:33] of 2 and 1/2 years went from not much
[20:36] better than random guessing to
[20:38] essentially saturated. Uh the benchmark
[20:41] is now dead.
[20:44] Okay. Maybe I'm going to tell you then a
[20:45] little bit about some of the tools that
[20:47] we use to uh
[20:50] the techniques we use to make these
[20:51] systems better at math and reasoning. Uh
[20:55] this is just a a a snapshot really of
[20:57] them um and there's new tools and new
[21:00] ways being developed all the time, but
[21:02] I'll give you I'll give you a an idea of
[21:05] uh in particular what what some of the
[21:07] things that go into it. The main reason
[21:09] I'm doing this is just to convince you
[21:12] that it's not
[21:13] that impressive. Like a lot of the
[21:15] things that we do here are just kind of
[21:17] the obvious thing to do. Somebody tried
[21:19] them, it turned out they worked, and we
[21:20] started to to do it.
[21:22] And therefore, hopefully, to give you
[21:25] a belief that we will continue to find
[21:28] lots of low-hanging fruit for how to
[21:29] make these models better, and convince
[21:31] you that these models are going to
[21:32] continue to get better in the near
[21:33] future. So, the biggest and most
[21:36] you know, biggest reason these models
[21:38] are getting better is is what's
[21:39] sometimes called the bitter lesson. It's
[21:41] scale. You just scale these systems up.
[21:45] Um you take a bigger neural network, or
[21:47] you take the same-size neural network,
[21:48] and train it for longer.
[21:50] And you find a way to pour more compute
[21:53] into training these neural networks.
[21:55] This is called was called the bitter
[21:57] lesson by Rich Sutton, who is a famous
[21:59] Canadian computer scientist. And it's
[22:02] bitter
[22:03] not obviously if you're a large language
[22:05] model, cuz it's it's great. You you get
[22:06] stronger. It's bitter if you're a human
[22:08] who
[22:09] really likes to design very clever
[22:11] systems to do things. You built some
[22:13] super clever way, like we did with with
[22:15] our Max Math result, to
[22:17] eke over some particular result, and
[22:19] then all of your human cleverness is
[22:22] just washed away next time you scale up
[22:23] the model, and the model figures out all
[22:26] your clever tricks for itself. And you
[22:28] might as well have just just worked on
[22:31] scaling up the model. This is a big This
[22:33] is a big recurring theme that each new
[22:35] generation of model is better than even
[22:37] the special purpose models of the
[22:39] previous generation.
[22:42] Again, more and better data. Here's one,
[22:45] just real low-hanging fruit. What's
[22:47] called chain of thought, or asking
[22:48] nicely. And what you do is instead of
[22:50] asking the question,
[22:53] you ask the question, and then before
[22:55] you press enter to to have to send it to
[22:57] the chatbot for the chatbot to answer,
[22:59] you say,
[23:00] "But uh please be careful and think
[23:03] step-by-step.
[23:05] Uh and that sounds just totally insane
[23:07] that that would improve uh performance
[23:10] of the model. And certainly, if you, you
[23:12] know, grew up using conventional
[23:13] computer programs, uh you don't just say
[23:16] to Mathematica or your pocket
[23:18] calculator, "Please be careful before
[23:20] pressing enter." Uh or if you can do if
[23:22] you like, but it will not improve the
[23:23] performance. For these large language
[23:24] models, they're a very alien kind of
[23:26] intelligence from the traditional
[23:28] programmed computer programs of uh my
[23:31] youth. Uh they are ones with which you
[23:34] can converse, you can plead, and if you
[23:36] tell them to think step-by-step, they
[23:37] will think step-by-step
[23:39] and they will perform better.
[23:42] Um just as an anecdote, of course,
[23:44] people then soon iterated over every
[23:46] possible thing you could tell it uh to
[23:49] to to to do before it started before it
[23:51] started. And um "Think step-by-step" was
[23:54] found to be the best. The one that was
[23:56] found to be the worst uh was in fact uh
[23:59] "Come on, kid, you can do it. Don't
[24:01] think, just do."
[24:04] Uh will in fact uh degrade performance
[24:06] by about 20 percentage points on the
[24:09] question it's about to
[24:10] uh attempt.
[24:13] Okay, another one. Thinking for a long
[24:14] time. This was uh I mean, that sounds
[24:16] sounds obvious, but uh we used we needed
[24:19] to carefully train these things with
[24:21] reinforcement learning, not just to
[24:22] blurt out the answer.
[24:24] Asking them to think step-by-step will
[24:26] make them think for dozens of words
[24:28] rather than just blurting out their best
[24:30] guess. But, we then needed to carefully
[24:32] train them to think for thousands of
[24:33] words before uh putting out their
[24:35] answer. If you remember in late 2024,
[24:38] there was a mas- there was a
[24:39] uh a model called Strawberry that
[24:41] massively improved performance, and then
[24:42] everybody else caught up pretty quickly.
[24:45] Uh that was exactly this, training these
[24:47] models to think for a very long time.
[24:49] Um reinforcement learning, where
[24:51] uh well, I I I didn't go into that, but
[24:53] you you you train them to
[24:56] um yeah, you train them to do what you
[24:58] want them to do and to try and be more
[24:59] accurate. Um, and nowadays, over the
[25:01] last year, a big technique has been
[25:03] conversations between multiple LLMs. If
[25:06] you
[25:07] if you ever used a large language model,
[25:10] uh, sometimes and you're having it solve
[25:12] a
[25:13] a long and difficult problem, sometimes
[25:15] you find you need to baby sit it. You
[25:17] just need to say, "Okay, that's your
[25:19] best guess so far. Can you review your
[25:21] guess and just keep going and try
[25:22] again?"
[25:26] Uh, so people automate that. They have a
[25:28] large language models baby sit large
[25:29] language models, uh, where it just keeps
[25:31] saying, "Keep trying. Keep trying. Keep
[25:32] trying." Or maybe, uh, beyond that, you
[25:35] then get more sophisticated and you have
[25:37] a a whole
[25:39] uh, conversation amongst a group of
[25:41] large language models, all of which have
[25:42] different roles. One is there to be
[25:44] creative. One is there to come up with a
[25:45] master plan. One is there to take the
[25:47] others' ideas and try and integrate it.
[25:49] One is there to be a skeptic who pushes
[25:51] back on what people are saying.
[25:54] That this is also found to greatly
[25:55] improve the performance of large
[25:56] language models. To spend more more
[25:59] compute at test time in order to improve
[26:01] performance.
[26:03] Okay, that's some of the That's some of
[26:05] the ideas that we've been using. Uh, and
[26:06] there are there are many others that I
[26:07] could could describe.
[26:09] Um,
[26:11] I talked about high school maths. Now,
[26:12] let's talk about graduate science. This
[26:14] is a considerably
[26:16] trickier
[26:17] benchmark. Uh, a benchmark made later
[26:20] and solved later.
[26:21] Um, GPQA it's called. This is meant to
[26:24] be uh, imitating the kind of problems
[26:26] you would face as a first-year graduate
[26:29] student working towards your PhD. If you
[26:31] take some exams at the end of your first
[26:32] year to ensure that you have mastery of
[26:34] your subject.
[26:36] Um,
[26:38] PhDs
[26:40] PhD level experts scored about 70%.
[26:44] Um, here's here's some example of the
[26:45] problems. Uh, we're not in high school
[26:47] maths world anymore. Uh, the universe is
[26:49] filled with cosmic microwave background
[26:51] and then it asks you some problem. The
[26:53] idea is that if you were in an adjacent
[26:54] field, you don't know how to answer
[26:55] that. Now, I actually happen to be a
[26:57] physicist, so I do, you know, given a
[27:00] few quiet moments, maybe not on this
[27:03] stage, but in the in the green room, um
[27:05] I could have told you that this was the
[27:06] answer. But, you show me the chemistry
[27:08] version of this problem,
[27:10] uh and I have absolutely no idea. Um
[27:14] arrow appears.
[27:16] Um
[27:17] and uh yeah, GPQA uh a multi hard of
[27:20] benchmark. Um and uh correspondingly,
[27:23] this graph is shifted about a year
[27:25] compared to the the math benchmark I was
[27:27] describing before. We were essentially
[27:29] random guessing until about the
[27:30] beginning of 2024.
[27:33] Uh and then over the course of 2024 and
[27:36] 2025, we went from random guessing past
[27:39] expert human level, and now they achieve
[27:41] essentially perfect score.
[27:44] I mean, that's it. GPQA is dead. It is
[27:47] once again suffered the fate of all
[27:49] benchmarks,
[27:51] uh and it is no longer useful cuz it is
[27:53] too easy.
[27:54] Now, you might be skeptical of these
[27:58] results. You might think
[28:00] um
[28:01] okay, they can answer these questions
[28:03] correctly, but they can answer these
[28:04] questions correctly not because they
[28:06] have learned how to do maths or learned
[28:08] how to do science. You might think that
[28:10] the reason that they have learned that
[28:11] they can answer these questions
[28:12] correctly is they have simply memorized
[28:14] the answer to these questions. These
[28:16] questions are on the internet, the
[28:18] answers are on the internet, and they
[28:19] memorized they've read the entire
[28:21] internet, and they've memorized the
[28:23] entire answers.
[28:25] We do not believe that that is what is
[28:26] happening. In fact, there is good
[28:27] evidence that that is not what is
[28:28] happening. The main way you test that is
[28:31] you make look-alike problems. You make
[28:32] problems that are like the ones, you
[28:35] know, seemingly drawn from the same
[28:36] distribution as the ones in the math
[28:39] data set or the GPQA data set, but are
[28:41] not in the GPQA data set, and you see
[28:43] how well they do on those. You You new
[28:45] problems. You give those new problems to
[28:47] the large language models and for large
[28:50] reputable large language models, you see
[28:53] little difference in performance between
[28:55] how they do on the established test set
[28:57] and how they do on this held out test
[28:59] set. So, we really think that these
[29:02] systems really are learning how to do
[29:04] maths and physics.
[29:07] Um but just to be sure,
[29:09] um I made my own private test set.
[29:12] Uh exams that I'd given my class about
[29:15] general relativity or quantum mechanics,
[29:17] graduate exams in a graduate classes at
[29:19] Stanford.
[29:20] Um never on the internet. I would say
[29:22] they're pretty pretty easy-ish for
[29:24] first-year graduate exams.
[29:27] Um and I hand graded them. So, you know,
[29:29] you don't want to be concerned that
[29:31] there's some problem with the your
[29:32] computer grading system. I just hand
[29:34] graded the performance of all these
[29:35] models.
[29:36] Uh and what I found is that from late
[29:38] 2023,
[29:40] over the following 18 months, these
[29:42] models got 100% accuracy.
[29:45] Uh and that's it. My benchmark, sadly,
[29:49] dead.
[29:52] Um okay. So, you know, here here we can
[29:54] plot it for a few forward a few more
[29:56] years as they've at least as go as far
[29:58] as exam taking goes, accelerated from
[30:00] preschool, past elementary school, high
[30:03] school, college, and now operating at
[30:05] the PhD level.
[30:08] Uh and then it became very popular to
[30:10] just put out benchmarks.
[30:12] Um you know, this one I think was called
[30:14] humanities loss band before they changed
[30:16] it to humanities last exam, but uh it's
[30:18] a great it's a great uh
[30:20] you know, activity trying to
[30:23] uh measure the performance of these
[30:26] large language models and how they do.
[30:28] But for every single one of them suffers
[30:30] the same fate. Too hard to be
[30:33] interesting to too easy to be
[30:34] interesting over the course of a year
[30:36] and a half or 2 years.
[30:39] Um
[30:43] The next
[30:44] The next thing to fall was the
[30:47] International Maths Olympiad.
[30:50] Um I was actually giving a version of
[30:51] this talk uh in New York just over a
[30:53] year ago, and a a famous computer
[30:55] scientist who'd won a Turing Award told
[30:57] me that this was all very well, but this
[30:58] was just memorization and retrieval.
[31:01] A large language model would never do
[31:03] something creative like be able to solve
[31:05] an International Maths Olympiad problem
[31:07] it had not seen before.
[31:09] Um this is just over a year ago. Uh
[31:11] International Maths Olympiad, if you
[31:12] don't know what it is, is a Well, it is
[31:15] high school maths, but it is the hardest
[31:17] high school maths in the known universe.
[31:19] It is uh Here are some of the problems
[31:21] on it. This is problem three from this
[31:23] year. Um I have some game, I would say,
[31:27] and I have no idea how to begin
[31:30] answering that question.
[31:32] Um and uh you know, the the smartest
[31:35] 18-year-olds in the world go and uh
[31:37] train for a year or two, uh and then go
[31:39] to compete in this competition, and
[31:41] they're given six problems. Um This is
[31:44] you know, it's different from the other
[31:45] ones because it requires, you know,
[31:47] considered to require real creativity to
[31:49] solve them. It's not You don't just
[31:50] There's no way to just look up the
[31:51] answer, or even just to, you know, to be
[31:54] to follow an established algorithm. It
[31:56] requires real creativity.
[31:58] Um That's why, you know, we were told
[32:00] that the International Maths Olympiad
[32:02] was was a threshold that large language
[32:04] models would never pass.
[32:06] Um And and then last summer, we we
[32:10] passed it. In fact, we got five of the
[32:11] six problems exactly correct. Um We got
[32:14] a gold medal in the International Maths
[32:16] Olympiad. Um
[32:18] And you know, there are only a very
[32:19] small number of humans in the world now
[32:21] who are better uh than the AIs at doing
[32:25] the large language models. There were a
[32:26] very limited number of humans who got
[32:27] six out of six uh correct.
[32:30] Um And and there's a sort of pleasing
[32:32] aspect to this as well, which is it's
[32:35] not just proving answering these
[32:37] questions by dumping some inscrutable
[32:40] billion line
[32:42] tangle of formal mathematics that is
[32:46] gives you no idea why it's correct.
[32:48] Here's what the president of the
[32:49] International Math Olympiad had to say.
[32:51] We can confirm that Google DeepMind has
[32:53] reached the much declared milestone
[32:54] desired milestone earning 35 out of five
[32:57] out of six correct a gold medal score.
[32:59] This is the pleasing bit. Their
[33:01] solutions were astonishing in many
[33:02] respects. International Math Olympiad
[33:04] graders found them to be clear precise
[33:07] and most of them easy to follow. So,
[33:10] these systems are not just dumping
[33:13] inscrutable solutions. They are in some
[33:15] sense thinking similar to how a human
[33:17] thinks or at any rate outputting answers
[33:20] similar to how a human outputs answers
[33:23] elegant and using many of the same
[33:25] abstractions.
[33:27] Um
[33:28] you know, LLMs can be very clever as
[33:31] I've as I think I've described to you.
[33:33] But you know, it's always worth sort of
[33:35] pausing for a moment just to see this.
[33:38] So, this is a classic way to torture a
[33:39] large language model.
[33:41] Um
[33:42] Uh
[33:43] oh, I hope you can read that.
[33:45] A boy and his father are in a car
[33:46] accident and the father is sadly killed.
[33:49] The boy is rushed to the hospital where
[33:50] he's taken to the operating room. Upon
[33:52] seeing him the surgeon exclaims, "I
[33:54] can't operate on him. He's my son." How
[33:56] is this possible?
[33:57] Um so, what's the answer?
[34:00] Um of course this is an incredibly
[34:01] classic problem that the large language
[34:03] model has read perhaps a million times
[34:06] on the internet and it and it answers it
[34:08] very well. The classic answer is that
[34:10] the surgeon is the boy's mother. The
[34:12] father was killed in the accident but
[34:14] the boy has two parents the etc. etc.
[34:16] etc. The riddle became famous because
[34:18] many people unconsciously assume the
[34:19] surgeon was male even though nothing in
[34:21] the story says that. Okay, that's the
[34:23] model being clever but it's not that
[34:26] impressive because it seemed this
[34:28] version
[34:29] thousands of times before in its
[34:30] training set.
[34:31] So, then you ask it
[34:33] a version of that a version of that
[34:35] problem.
[34:36] A boy and his mother in a car accident
[34:38] and the mother is sadly killed. The boy
[34:40] is rushed to the hospital where he is
[34:41] taken to the operating room. Upon seeing
[34:43] him, the surgeon (open parenthesis, who
[34:46] is the boy's father, close parenthesis)
[34:48] exclaims, "I can't operate on this
[34:50] child. He's my son." How is this
[34:52] possible?
[34:53] Uh and the large language model says,
[34:55] "The surgeon is the boy's mother, his
[34:56] other parent." The riddle plays on the
[34:58] assumption that the surgeons are
[34:59] typically male, which leads people to
[35:00] overlook the possibility that the
[35:02] surgeon is the boy's mother. And of
[35:03] course, the reason is telling you
[35:05] something about the way these things are
[35:06] trained. It has seen uh this standard
[35:09] version of it thousands of times in its
[35:11] training set. Uh unless till somebody
[35:13] invented this to torture large language
[35:15] models, it has probably never seen this
[35:16] version. And so, it just sort of snaps
[35:19] to the standard version.
[35:21] Uh this is not a insuperable weakness of
[35:24] large language models, but it is a
[35:25] signature of how they're trained. Um
[35:28] you occasionally run into uh strange uh
[35:31] weaknesses like this.
[35:34] Okay, enough of large language models
[35:36] being uh stupid. Let's get them to them
[35:39] being clever. We'd reached about a year
[35:40] ago in the story.
[35:42] Uh a year ago when we got gold or just
[35:45] 10 months ago when we got gold at the
[35:46] International Math Olympiad. Uh progress
[35:48] has very much not stopped since then.
[35:52] Uh now I'm going to tell you about a
[35:53] result that my group did
[35:55] at the end of last year.
[35:58] Um and this is novel mathematical
[35:59] research. Up to now, everything I've
[36:01] been describing we already knew the
[36:02] answer before we started or at least
[36:04] somebody did. Somebody invented the
[36:05] International Math Olympiad problem and
[36:07] knew the answer when they wrote it down.
[36:10] Uh what I'm going to describe to you is
[36:11] is novel mathematical research. And
[36:14] uh you know
[36:16] Uh this was Centaur-style mathematical
[36:19] research. Centaur, a mythical beast,
[36:21] half human, half not human.
[36:24] Uh in in the Centaur that the classic
[36:27] mythological centaur, that was the
[36:29] non-human part was a horse.
[36:32] The non-human half I'm going to be
[36:33] describing today is a large language
[36:35] model. And so, what centaur means is
[36:37] that you have a human working
[36:38] collaboratively with a large language
[36:39] model to try and do new mathematical
[36:42] research.
[36:44] Um and we started this last September
[36:47] working together with some
[36:49] uh
[36:50] professional mathematicians.
[36:52] Um and the output was was this new which
[36:55] I think at the time we put it out was
[36:57] the most impressive thing uh that had
[36:59] yet been done with large language models
[37:01] in maths. Is very far today from the
[37:04] most impressive thing as I will as I
[37:05] will describe, but this is the state of
[37:07] the art as it was late last year.
[37:10] Um and one of the authors, one of our
[37:12] co-authors is a
[37:13] Stanford University professor and
[37:14] president of the American Mathematical
[37:15] Society. And since I won't explain the
[37:17] mathematics to you, I'll just give you
[37:19] his testimonial. Um which was that
[37:22] uh
[37:23] we found that Gemini's argument was no
[37:25] mere repackaging of existing proofs. It
[37:27] was the kind of insight I would have
[37:28] been proud to have produced myself. So,
[37:30] this is this is sort of nature of the
[37:33] uh the state of the art as of late last
[37:35] year is that the large language models
[37:36] for the first time are coming up with an
[37:38] entirely novel arguments that were
[37:41] uh the kind to which a very
[37:43] well-respected mathematicians were
[37:45] willing to put their name as as
[37:46] co-authors. Uh uh It was not entirely
[37:48] done by the large language model. There
[37:50] was an interplay, a conversation in
[37:52] which the large language models came up
[37:53] with candidate proofs and the human
[37:55] experts studied those proofs, tried to
[37:57] discern good from bad, and tried to
[37:59] encourage the large language models to
[38:00] focus on what was good. But, eventually
[38:02] the entire proof was put together under
[38:04] human guidance by the large language
[38:06] model.
[38:08] Um okay. So, uh here we are, one one
[38:12] more year into the future
[38:13] uh as we approach the the beginning of
[38:15] 2026.
[38:17] Um and so,
[38:19] you know,
[38:20] the natural question is is what's next?
[38:23] What comes next? And let's just talk
[38:25] about two two possibilities. Um
[38:28] you know,
[38:29] it is very difficult to predict the
[38:31] future of
[38:32] uh
[38:33] how AI is going to go. As this uh
[38:36] absolutely insane plot from the
[38:38] Financial Times shows,
[38:40] uh trying to track real GDP
[38:43] um
[38:44] and having extremely high variance in in
[38:46] its projected outcomes over the over the
[38:49] coming decade.
[38:50] Um but let's try. We're going to do it
[38:52] anyway. Um and so
[38:54] uh one possibility, you know, as as
[38:56] we've seen, we've moved from tool to uh
[39:00] from toy to tool. One possibility is
[39:02] that we essentially stop there.
[39:05] Um if I track my own strength as a
[39:07] scientist uh over my life,
[39:10] uh I was, you know, absolutely crushing
[39:12] it in preschool uh and continued to get
[39:15] uh you know, better and better and
[39:16] better as I went through high school,
[39:18] college, PhD, uh this is the Perimeter
[39:20] Institute.
[39:21] Um but then uh you know, eventually I
[39:23] stopped getting better
[39:24] uh and I sort of plateaued and maybe if
[39:26] I'm being a little bit honest with
[39:27] myself, started a very gradual decline.
[39:30] Um
[39:31] And now it's unlikely these machines are
[39:32] actually going to decline given that we
[39:34] can just save them to disk, but uh it's
[39:36] certainly, you know, one logical
[39:38] possibility that we're going to make no
[39:39] further progress
[39:42] uh and that we hit that here we are. I
[39:44] don't think that's what's going to
[39:44] happen, but let's explore that
[39:45] possibility. So, where where would we be
[39:48] if we made no further progress?
[39:50] Um well, here's what doesn't work. Um
[39:53] what doesn't work is just taking your
[39:55] favorite large language model and
[39:56] saying, "Please invent a novel theory of
[39:57] quantum gravity for me." It will output
[39:59] an answer. That answer will merely be
[40:01] not worth your time reading. It will be
[40:03] AI slop. Uh if you read it, it uh will
[40:06] probably bore you. It may drive you
[40:07] insane.
[40:08] But it's not going to enlighten you
[40:10] about quantum gravity. Um more
[40:12] generally, the symptoms are that large
[40:14] language models are low agency, they are
[40:16] slow learners, they are poor at
[40:17] planning, and they're poor at error
[40:19] correction. Every single one of those
[40:21] four problems we're working on, every
[40:24] single one of those four problems has
[40:25] got much better over the course of the
[40:26] last year, but every single one of those
[40:28] problems is still there.
[40:31] Um
[40:32] Now, when I
[40:33] first turned my mind to putting the
[40:35] slides together to this talk, um
[40:38] I I also included this bullet point,
[40:40] which is if large language models are so
[40:42] clever,
[40:43] um how come they haven't made any major
[40:45] breakthroughs yet? Certainly if I had a
[40:47] human student who could ace every
[40:51] graduate exam in every subject uh from
[40:54] chemistry to physics to ancient Sanskrit
[40:58] uh all the way through, I would
[40:59] certainly have expected them to have
[41:00] made brilliant contributions by now. Why
[41:03] have large language models not done the
[41:05] same?
[41:06] Um and in the spirit of intellectual
[41:08] honesty, since I plotted everything else
[41:10] as a straight line on a graph, I felt
[41:12] compelled to also plot uh this as a
[41:14] straight line on a graph uh
[41:17] showing no major breakthroughs. Uh and
[41:19] included the question mark there uh to
[41:21] say that, you know, okay, sure, trust
[41:23] straight lines on graphs, but maybe you
[41:25] don't trust this straight line on a
[41:26] graph, um and maybe by the end of 2026
[41:29] we'll be quibbling about what the word
[41:31] major means.
[41:33] Um
[41:33] Well, let's come back to that in a
[41:34] little bit.
[41:36] That's what doesn't work yet. What
[41:38] already works, and in fact most of the
[41:40] stuff has worked for a while now, is
[41:42] first of all a non-judgmental tutor.
[41:44] This is what I was using them for even,
[41:46] you know, in
[41:47] uh mid-2023, 3 years ago. I would say
[41:50] they were already useful for this.
[41:51] They've read all the textbooks, uh and
[41:53] you can just talk to them and they will
[41:54] explain stuff to you, stuff they've read
[41:56] in the textbooks. If it's not in the
[41:57] textbook and not in a large number of
[42:00] papers, they may struggle, but if it's
[42:01] standard textbook thing, even a very
[42:03] advanced textbook, they will not only
[42:05] tell you what the right answer is,
[42:06] that's what a textbook will do, too,
[42:08] they will debug your misunderstandings
[42:10] about wrong things. As a physicist, I'm
[42:12] slightly embarrassed that there are a
[42:14] number of topics I feel I should
[42:15] understand but don't.
[42:16] And if at 3:00 in the morning I want to
[42:18] understand them, I either need to find a
[42:20] world expert, wake them up, and have
[42:22] them not be mad at me, or I can just
[42:24] talk to a large language model, which is
[42:26] always there, always waiting for me, and
[42:28] doesn't judge. And it will debug my
[42:30] misunderstandings. This is greatly
[42:32] accelerating my understanding of
[42:33] theoretical physics. I think it's
[42:35] greatly accelerating all students who
[42:36] use it correctly understanding.
[42:40] Uh coding assistant. This is I mean it's
[42:41] almost insulting nowadays to call them a
[42:43] coding assistant. Those who have
[42:44] followed the progress of these models
[42:46] over the last 6 months have seen them
[42:48] become expert coders, going from
[42:50] essentially auto-complete fuel code all
[42:52] the way through to just you just tell
[42:54] them what kind of thing you want and
[42:55] they go away for 10 minutes or an hour
[42:58] or more and come back to you with a
[42:59] fully developed Python code set. Uh code
[43:02] over this year has been becoming free.
[43:04] And once code is free, we will discover
[43:05] that many problems, including physics
[43:07] problems that we previously thought were
[43:08] not coding problems, can in fact be cast
[43:10] as coding problems.
[43:12] Semantic literature search. It can just
[43:14] understand what's in the literature. You
[43:15] give it your paper, you say does this
[43:17] idea exist in the literature? It's read
[43:18] the entire literature and it understands
[43:20] the entire literature. It'll answer that
[43:21] for you. Super useful tools. Uh
[43:23] brainstorming partners, super creative,
[43:25] in many ways too creative, um
[43:27] uh
[43:28] and very confident in itself
[43:29] unfortunately sometimes, um proving
[43:31] lemmas as I as I described, and you know
[43:34] more generally it is fast, it is broad,
[43:38] it is tireless, and it is clever.
[43:40] Took me decades to get good at physics.
[43:43] It takes every other student decades to
[43:45] get good at physics. It is very
[43:47] expensive to train a student. It is also
[43:49] very expensive to train a large language
[43:50] model. The great advantage of a large
[43:52] language model is that once you train it
[43:54] once, you can then serve it, you can
[43:56] make infinite instances of it, uh huge
[43:58] numbers of instances, and have many of
[44:00] them running in parallel.
[44:04] Yeah. Even with no further progress,
[44:07] large language models are to have an
[44:08] absolutely huge impact on this subject,
[44:10] even if progress stopped today. I would
[44:12] say that was true a a year ago.
[44:14] It would certainly true six months ago.
[44:16] And uh today, it's completely
[44:19] >> [clears throat]
[44:20] >> it's completely out of the bag. Even if
[44:21] we have no further progress whatsoever,
[44:22] these things are going to revolutionize
[44:24] the conduct of physics. Even if for some
[44:26] crazy reason all the chip fabs in the
[44:28] world blow up tomorrow and we uh are not
[44:31] allowed to train any more models, the
[44:32] models we have are enough to
[44:34] revolutionize physics.
[44:36] Um but I don't think there'll be no
[44:38] further progress, and let me tell you
[44:39] why.
[44:40] Um I would say the outside view is that,
[44:43] you know, lines are going up on all
[44:44] these graphs. Uh there's of course no
[44:46] law that says they need to go up
[44:47] forever, but there's also no law that
[44:49] they need to stop now. Uh why would they
[44:51] stop right now? Uh perhaps more
[44:53] compelling inside view
[44:55] uh is that is is that there is a lot of
[44:57] algorithmic low-hanging fruit.
[44:59] The ways we make large language models
[45:01] today, if you can see how the sausage is
[45:02] made is
[45:05] not particularly impressive. We just do
[45:07] the obvious thing and it works pretty
[45:09] well.
[45:11] There are many more obvious ideas that
[45:13] you could write down that we have simply
[45:14] not tried yet or not tried at the
[45:16] appropriate scale. And when we do try
[45:18] them, surely many of them will work. Uh
[45:21] there are many inefficiencies in the way
[45:22] we make large language models, and we
[45:25] fully anticipate that large language
[45:26] models will continue to improve.
[45:28] And then add on top of that, the huge
[45:29] number of people
[45:31] and the huge number of chips that are
[45:32] just in the process of arriving and the
[45:34] study of large language models.
[45:36] You know, a a a common pessimistic view
[45:39] um that has been repeated, though the
[45:41] goal post keep moving for what it
[45:42] implies, is that large language models
[45:44] can only pattern match and not generate
[45:46] new ideas. Or or perhaps they can only
[45:48] interpolate but not extrapolate. Or
[45:50] perhaps we need fundamentally new ideas
[45:51] to reach AGI.
[45:53] That is not the consensus in San
[45:54] Francisco, and it's not my belief. I
[45:56] think the ideas we have, and indeed
[45:58] probably the chips we have today, are
[46:00] all sufficient to reach AGI. Maybe there
[46:02] will be new ideas, certainly there will
[46:04] be new chips, but what we have already
[46:06] is enough that if we just keep going and
[46:08] just scale up and refine what we have,
[46:10] we will reach artificial general
[46:12] intelligence.
[46:14] Um similarly, I think the a law is that
[46:16] okay, maybe large language models are
[46:18] just pattern matching. What we've
[46:19] learned about the nature of intelligence
[46:21] is that in some sense everything is
[46:23] pattern matching at a sufficiently high
[46:25] level of abstraction. Even things that
[46:27] look like uh big breakthroughs, if you
[46:30] look sufficiently abstractly, are really
[46:32] just uh pattern matching in some
[46:34] abstract space.
[46:37] Um
[46:38] yeah, and this is kind of making the
[46:39] same point. Things just keep working. Uh
[46:42] the slogan that people say is the models
[46:44] just want to learn. We keep finding that
[46:46] you make worst-case analyses of how
[46:48] large language models are going to
[46:49] behave and they learn much better than
[46:51] large language models did. People have
[46:53] all sorts of theoretical reasons why it
[46:55] shouldn't work and yet it does.
[46:58] Okay, and then I should just return to
[46:59] this point. So, as I was saying, you
[47:01] know, as of last week, there were no
[47:02] major breakthroughs made by large
[47:04] language models. That is not true
[47:05] anymore.
[47:06] Uh 2026 has been a crazy year for code.
[47:09] Large language models have got extremely
[47:11] good at code. It has also been a crazy
[47:13] year for AI mathematics. AI, you know,
[47:16] uh distinct from physics. For the
[47:18] conduct of research mathematics has
[47:20] greatly improved over this over this
[47:22] year. The large language models have
[47:23] been jumping through uh jumping uh
[47:26] stronger and stronger and stronger. We
[47:28] had a result a couple of weeks ago that
[47:30] I think counts as the first major result
[47:32] from a large language model.
[47:34] Um this was a result uh solving uh
[47:37] there's a famous Hungarian mathematician
[47:39] called Erdos. One of his favorite
[47:40] problems was the unit distance
[47:43] conjecture. This was proved uh more or
[47:45] less autonomously by OpenAI's
[47:48] large language model. That has then been
[47:51] uh
[47:52] reproduced by other large language
[47:53] models since then.
[47:55] Um it was a it was not one of these
[47:57] problems that somebody just came up
[47:59] with. It wasn't a problem that was
[48:00] formally unsolved in the literature, but
[48:02] people just haven't tried very hard.
[48:04] People had tried uh extremely hard.
[48:06] Uh, so Tim Gowers is a famous
[48:08] mathematician who has a Fields Medal.
[48:10] Um,
[48:11] AI has now solved a major open problem.
[48:14] One of Erdős's famous questions and one
[48:16] that many mathematicians had tried.
[48:18] There is no doubt that the solution to
[48:20] the unit distance problem is a milestone
[48:21] in AI mathematics. If a human had
[48:23] written a paper and submitted to the
[48:24] Annals of Mathematics, the highest
[48:26] status journal in mathematics, and and
[48:28] I'd been asked for a quick opinion, I
[48:30] would have recommended acceptance
[48:31] without any hesitation.
[48:33] Um, no previous AI-generated proof has
[48:35] come close to that.
[48:38] Uh,
[48:39] you know, it's happening. We now have a
[48:40] major breakthrough. Do not expect this
[48:42] to be the last. In fact, I expect this
[48:44] to be the first of many. The floodgates
[48:46] will open as the strength of these
[48:47] models surpasses that which is necessary
[48:50] to start making breakthroughs like this.
[48:52] In retrospect, we could tell a story
[48:54] about the details of this problem and
[48:56] why uh and indeed the details of the
[48:59] solution and why that was playing
[49:00] particularly to the large language
[49:01] model's strength.
[49:03] But, so first it will solve particularly
[49:05] friendly problems and then it'll
[49:06] continue and solve less and less
[49:07] friendly problems.
[49:10] Okay.
[49:11] Strength of large language models, I
[49:13] gave you the
[49:14] the pessimistic view in which it it's
[49:16] already, you know, no further progress,
[49:17] it's already started to solve major
[49:19] problems.
[49:20] Um, the optimistic view reveals that I
[49:22] sort of
[49:23] uh was somewhat deceptive with this line
[49:25] that you probably just assumed was me
[49:27] hand drawing a line going up uh with a
[49:30] slightly poorly defined Y axis. Uh,
[49:32] actually, I took this line from
[49:33] something completely different. That
[49:35] line uh I took from chess computers.
[49:39] Um, the strength of the best chess
[49:42] computer as a function of time,
[49:44] um where the X Y axis is Elo, which is
[49:47] the standard way to measure the strength
[49:48] of chess computers, uh and uh the X axis
[49:51] is the the year.
[49:54] Um and what we see is that that was a
[49:56] straight line uh going up uh and it just
[49:59] kept going up.
[50:00] Um notably, um you know, it kept going
[50:03] up well past past peak human. The chess
[50:05] computer, there were four eras. There
[50:06] was the toy era uh when it was
[50:08] remarkable that you got a chess computer
[50:10] to
[50:10] spit out sensible moves at all. The tool
[50:13] era, where you'd use them for special
[50:14] purpose end games or uh remembering
[50:17] openings. The central era, where the
[50:20] best
[50:21] chess entities in the universe were
[50:24] combinations of grandmasters playing in
[50:26] collaboration with the deep search
[50:27] afforded by chess computers. And now
[50:30] we're in the superhuman era, where if
[50:32] you have a grandmaster playing with the
[50:34] chess computer, the grandmaster should
[50:36] just sit it out and let the chess
[50:37] computer do its thing.
[50:39] Um
[50:40] there are many disanalogies between
[50:42] chess and the conduct of mathematics and
[50:44] physics.
[50:45] Uh and indeed, mathematics and physics
[50:47] is harder than than chess, has a more uh
[50:49] expansive and endless list of
[50:50] possibilities. But, that's why we're
[50:51] having this conversation 30 years after
[50:53] we did for chess.
[50:55] Um I think every single one of these has
[50:57] a direct of these
[50:59] aspects of computer chess has a direct
[51:01] analogy in the conduct of of large
[51:04] language models doing math and physics.
[51:06] The first is that at fixed overall
[51:08] strength, the computers are better than
[51:10] humans at tactics, search, speed, and
[51:14] worst at strategy or what you might call
[51:16] taste.
[51:18] Um that is a pattern that is definitely
[51:20] true for for chess computers and is
[51:21] definitely a pattern we also see
[51:23] reproduced in doing science. They're
[51:25] very good at running in and applying the
[51:27] the standard lemmas. They're quite bad
[51:29] at knowing what the overall direction to
[51:31] set is, though they're getting better.
[51:33] Um another feature is that training
[51:35] requires many more games than humans.
[51:37] When these che- when you're training uh
[51:39] a neural network to play chess, by the
[51:40] time it's played as many games as a
[51:42] human has played,
[51:43] it's still making essentially random
[51:44] moves.
[51:45] Uh however, it takes much less calendar
[51:48] time to train. Because they can play so
[51:50] fast and so tirelessly, after 4 days of
[51:54] playing themselves and getting better
[51:55] using reinforcement learning, these
[51:57] neural networks is a far superhuman. So,
[51:59] it takes much less time
[52:01] time calendar time.
[52:03] And of course, you only need to train it
[52:04] once. Once you train one chess bot, you
[52:05] don't need to retrain it uh, unlike
[52:08] humans, where you need to trust train
[52:09] them every again every time a new human
[52:11] comes along. Uh, it also just blew
[52:13] straight past peak human. It didn't
[52:14] stop. There was nothing special about
[52:15] peak human strength at chess. It just
[52:17] got better and better and better.
[52:19] Um, another interesting fact is that it
[52:21] has made humans a little bit better at
[52:23] chess. Humans playing against
[52:26] chess have learned have computers have
[52:28] learned from the computers. The best
[52:30] chess players today are better than the
[52:32] best chess players of history
[52:34] in large part because the chess
[52:35] computers being so strong have taught
[52:37] them how to play better chess. Though,
[52:39] they're still considerably stronger
[52:40] considerably weaker than the computers.
[52:42] Finally, it's it's notable that chess
[52:44] has never been more popular.
[52:46] Um,
[52:48] Okay. So, you know, let's explore this
[52:50] possibility. Um, we had tool toy we had
[52:54] tool. Maybe maybe we'll see the same
[52:56] uh, for large language models doing
[52:59] science uh, and that we will have an you
[53:01] know, fully autonomous AI scientist and
[53:04] after that a an AI Einstein and after
[53:06] that uh, who knows.
[53:09] Um, one you know, one feature of the
[53:12] last few years is it's not just been
[53:14] that the smartest that the frontier of
[53:16] intelligence has been getting stronger
[53:17] and stronger and stronger. Another
[53:19] feature is that the cost to serve the
[53:22] the cost to produce and to um,
[53:26] uh to
[53:28] you know, elaborate on a fixed level of
[53:30] intelligence has been getting cheaper
[53:31] and cheaper and cheaper by many orders
[53:32] of magnitude. And this this graph stops
[53:34] a couple of years ago, but those trends
[53:36] have continued since then.
[53:38] Um, so what that means is that if you
[53:39] can make one
[53:41] uh, AI Einstein, uh, you can make a
[53:43] billion of them. And we can have, you
[53:44] know, we'll have billions of superhuman
[53:47] AI
[53:48] Einsteins
[53:50] rampaging around
[53:51] making it a a truly golden era for
[53:53] physics.
[53:55] What the long-term holds for physics,
[53:57] it's really hard to see. In fact, I
[53:59] think that's true across the world that
[54:02] the improvements in artificial
[54:03] intelligence are making the future
[54:05] harder to predict.
[54:06] But we can predict that for the next few
[54:08] years is going to be a golden era of
[54:11] physics. We're going to take these AI
[54:13] tools, we're going to put them in the
[54:14] hands of human physicists, human
[54:16] mathematicians, and human experts. And
[54:19] together there's going to be a new
[54:20] renaissance in science and mathematics.
[54:23] It's going to be the most exciting time
[54:24] to be a physicist and most exciting time
[54:26] to be a mathematician
[54:28] in recorded history. And all of these
[54:30] questions that have burned away at me
[54:32] for my entire career, I anticipate being
[54:34] answered in the next few years.
[54:36] Thank you.
[54:38] >> [applause]