[0:01] [applause] [0:03] >> Okay, thank you. Absolutely delighted to [0:05] be here. We live at an extraordinary [0:08] moment in our civilization's history. [0:11] We have collectively figured out how to [0:13] turn to refine sand into silicon, then [0:16] take that silicon and turn it into [0:18] silicon chips, then assemble those [0:20] silicon chips into neural networks, and [0:23] now how to train those neural networks [0:25] to think. [0:27] So, I've written about 40 theoretical [0:29] physics papers in my career so far, but [0:32] I've stopped. And I've stopped cuz it [0:35] self felt like too much of a guilty [0:36] pleasure to handwrite [0:39] theoretical physics papers one by one, [0:41] when what I should be doing is [0:42] contributing to the [0:44] production of a machine that is going to [0:46] spew out knowledge on an industrial [0:48] scale. [0:49] We've of course had for many years now a [0:52] uh computer assistance in doing physics, [0:56] going back to uh the invention of the [0:58] pocket calculator, or perhaps even [1:00] further back to the abacus. Uh [1:04] This This one is different. Those Those [1:06] are special purpose tools that we've [1:08] been using for particular parts of the [1:11] physics enterprise. [1:14] They uh help you as one step, and you [1:16] have to do the rest. [1:18] What's new is something that we didn't [1:21] know at the beginning of the decade, but [1:24] those of us who live in San Francisco [1:25] certainly think we know now, which is [1:27] that we know about the large language [1:30] model. And a large language model is has [1:33] the capability not just to be a special [1:35] purpose tool that replaces one part of [1:39] of the stack, but in fact do every [1:41] single part of my job as a theoretical [1:44] physicist. It is a general intelligence, [1:46] and we think that large language models [1:48] will be the substrate on which we build [1:50] these general intelligences. [1:53] Uh what I'm going to tell you about [1:54] today is using large language models to [1:57] do [1:58] maths and physics. I'm going to tell you [2:00] about the recent past of this process [2:03] and the successes we've seen, the [2:04] extraordinary progress indeed that we've [2:06] seen over the last half decade. I'm [2:09] going to tell you where we are today and [2:11] I'm going to tell you a little bit about [2:12] where I think we're going. [2:14] Uh but first of all, uh I should remind [2:16] you what a large language model is. [2:19] Um I hope uh you by now have have used [2:21] one. You can just use one. You can just [2:23] go to one of these websites and just [2:25] start talking to them and it'll talk [2:26] back. Uh it'll talk back in a way that [2:29] quietly passed the Turing test a couple [2:31] of years ago and nobody nobody really [2:33] celebrated it. So, we have Gemini, which [2:36] is the one that I contribute to. [2:39] Um and also some others, ChatGPT, [2:42] Claude, many other options out there, uh [2:45] all pushing the frontier of machine [2:47] intelligence. [2:50] Um at base, a large language model is a [2:52] kind of neural network. It is an [2:54] artificial [2:56] computing device [2:58] inspired by the human brain, inspired by [3:00] the arrangement of the neurons in the [3:01] human brain. [3:03] Uh and therefore quite unlike uh [3:05] traditional computer programs. [3:07] Um [3:09] At the beginning of the decade, the [3:10] largest [3:12] uh large language models had about a [3:14] billion parameters. That was considered [3:15] extraordinarily large at the time and [3:17] that they were called large language [3:18] models on that basis. Now, we're up to a [3:20] few trillion. [3:22] This is still short of the 100 trillion [3:24] synapses in the human brain, but it [3:26] turns out it it suffices. [3:31] Um and the one thing you need to know [3:33] about neural networks, all neural [3:35] networks including large language [3:36] models, is that they are not made like [3:39] traditional computer programs. They are [3:41] grown, not programmed. [3:44] What you do is you start off with a [3:45] assembly of artificial neurons connected [3:48] with artificial synapses [3:50] uh with essentially random weights. [3:53] And then you ask it to start speaking. [3:54] It'll start outputting words one after [3:57] the other. And what you'll find is that [3:59] those words are complete gibberish. [4:00] It'll just be uh, totally random words [4:03] at the beginning. [4:05] And then you train the neural network. [4:06] You you grow, if you like, the neural [4:08] network. Grow, you don't change the [4:09] number of of neurons, but you change the [4:12] neural pathways. [4:13] And the way you train them is that you [4:16] feed it some text and you encourage it [4:18] to predict [4:20] given a block of block of text, maybe [4:22] the you know, the some section of a book [4:24] you read on the internet, uh, [4:27] predict having seen the first 100 words [4:30] what the next word is likely to be. And [4:32] as I said, it'll just guess at random to [4:34] begin with. But every time you guess [4:36] right, you strengthen that synaptic that [4:39] neural pathway. And every time it [4:41] guesses wrong, you punish that neural [4:43] pathway. And so slowly over time, you [4:46] build up some predictive capability for [4:49] it to be able to predict what the next [4:51] word with is with better and better [4:53] accuracy. And once it can predict the [4:54] next word, you can [4:56] uh, then just take that next word, [4:59] assume it's the next word, and then [5:00] it'll just just start talking to you. [5:02] And that's how the chatbots work that I [5:04] described. [5:06] It's a slow process. Once it's seen [5:08] about a million words, you've trained it [5:09] on a million words, it's still spewing [5:12] stuff that's pretty much [5:13] indistinguishable from gibberish. Once [5:15] uh, you're up to tens of millions and [5:16] hundreds of millions and billions, it [5:18] can string together completely coherent [5:20] sentences. It knows the the rules of [5:22] grammar. It it puts sentences together, [5:25] but they're not particularly uh, refined [5:26] sentences. And by the time it's read the [5:28] entire internet, which is tens of [5:30] trillions of words, uh, it can do uh, it [5:33] can converse intelligently on on pretty [5:35] much any topic. [5:37] Uh, that's called pre-training and [5:39] that's that's what most of what you do [5:40] is just training it to predict the next [5:42] word on the internet. There's a a second [5:44] stage to the process called [5:46] post-training, uh, in which you [5:48] essentially send it to finishing school. [5:50] When it comes off pre-training, it is [5:51] just trained to predict what the next [5:54] word it's going to be in its in its [5:55] training corpus. [5:57] Uh and it is, you know, somewhat uncouth [6:00] and uh definitely disobedient. You need [6:03] to send it to finishing school called [6:05] post-training where you train it to uh [6:08] only be polite and you train it to try [6:10] and be helpful to the user rather than [6:12] just predict what the next word would [6:13] be. That's called post-training. [6:16] Um and that in brief is how you make a [6:18] large language model. [6:19] Uh and the train uh the modern large [6:22] language models with a few trillion [6:23] parameters, it takes a huge amount of [6:25] computing power to produce them. It's a [6:28] few trillion parameters, a few tens of [6:30] trillions of words. You need multiply [6:31] that together and you get trillions and [6:33] trillions of flops required to make [6:35] them. [6:36] Um [6:38] Okay. So, that's that's large language [6:39] models, uh which is what we're going to [6:41] be talking about. And we're going to [6:42] specifically talking about them doing [6:44] theoretical science. [6:46] Um [6:47] Before I begin, I should explain, you [6:48] know, this is sounds like computer [6:49] science, how how physicists got [6:51] involved. Well, physicists have been [6:53] involved in in every step of this [6:54] process. But one particular [6:57] uh pretty striking way in which they're [6:58] involved at the start of the decade that [7:00] launched the entire modern LLM boom was [7:04] through scaling laws. So, physicists [7:06] just uh love scaling laws. That's our [7:08] that's our bread and butter. [7:09] Um you know, some of the scaling laws [7:11] are uh simple. If you double double [7:13] Alice's height, you'll quadruple her [7:15] area and octuple her weight. That when [7:18] it's that simple, it's called [7:20] dimensional analysis. [7:22] But not all scaling laws are that [7:23] simple. [7:25] So, a uh empirical scaling law that was [7:27] discovered uh almost almost 100 years [7:30] ago relates that the mass of an animal [7:33] to its power output, to its metabolic [7:36] rate. [7:37] And what you find is is what is typical [7:39] in these scaling laws is you plot [7:40] everything on a log-log plot and you [7:44] find that over many, many orders of [7:45] magnitude it's a straight line which on [7:48] a log-log plot which tells you it's a [7:49] polynomial relationship all the way from [7:52] the the tiny mouse to the mighty [7:54] elephant. [7:55] Um and this, you know, like many of the [7:57] scaling laws discovered by physics, [7:59] well, this was first an empirical [8:00] discovery [8:01] uh [8:02] uh in which physicists were not involved [8:05] um and it actually has a rather curious [8:06] feature, a curious feature you know, [8:08] common to many of the the scaling laws [8:10] that physicists deal with. [8:11] Um and the curious feature is you might [8:13] imagine the power output of an animal [8:15] should be proportional to its mass, that [8:17] every kilogram of of your flesh, [8:19] uh you know, burns metabolically at the [8:22] same rate, but that's not true. [8:24] Actually, the larger you are, the less [8:26] each kilogram burns. Uh this was an [8:29] empirical discovery first uh by by [8:31] Kleiber, only much later understood by [8:34] physicists as a consequence of the [8:35] fractal dimension of our vascular [8:37] system. [8:39] Um all of this is mainly meant to be [8:40] warm-up to the idea that [8:44] uh we love scaling laws and so what we [8:46] did when we found large language models [8:48] was to make scaling laws for them. And [8:50] this this scaling law is the most famous [8:54] contribution of theoretical physicists [8:57] to computer science uh and also started [8:59] the modern LLM boom. [9:01] And the scaling law says, if you make a [9:04] bigger neural network, or more [9:06] precisely, if you spend more compute [9:08] training a neural network, and you scale [9:11] it appropriately in size and training [9:13] length, [9:14] uh how much better performance do you [9:16] get? So, better performance is down and [9:18] what what was empirically discovered is [9:21] that this is a a linear on a on a [9:24] log-log plot like this. Uh there's no [9:26] law of [9:28] nature that said it had to be like this, [9:30] but empirically, this is what it turns [9:32] out to be. [9:34] Um discovered by some physicists in [9:36] 2020. Uh and and this is great. Uh this [9:40] plot is so simple that even a venture [9:43] capitalist can understand it. [9:45] And it told them that if they poured in [9:47] compute, as in money, they would get [9:50] better performance for some for some [9:51] definition of performance, which is [9:53] basically accuracy of predicting the [9:55] next word on the internet. [9:57] Um and this, you know, original scaling [10:00] law was was over uh eight orders of [10:03] magnitude. We've now extended it eight [10:05] further orders of magnitude uh out out [10:08] to the right. [10:09] Uh and it it it pretty much continues to [10:11] hold. [10:12] Um [10:13] Okay. Large language models get [10:15] predictably better with scale. [10:18] This led to what's called the scaling [10:19] era, where we've been scaling up neural [10:21] neural networks, large language models [10:23] furiously ever since. [10:25] Uh and [10:27] uh and that has characterized the last 6 [10:28] years. [10:29] Um you know, I'm going to show you a lot [10:32] of uh straight lines on uh on graphs, [10:35] and kind of [10:37] invite you to imagine what happens if [10:39] those straight lines that really have no [10:41] business being straight, but [10:42] nevertheless are, and invite you to [10:43] imagine what will happen if those lines [10:45] continue to be straight for just a [10:46] little bit longer. That's going to be [10:48] part of my part of my talk. Um [10:51] The original straight line on a graph, [10:54] of course, was Moore's law. Um Moore's [10:57] law says that uh over time, it's, you [11:00] know, slightly cheating cuz the x-axis [11:01] isn't isn't some uh you know, physical [11:04] parameter, it's just date. Uh but over [11:07] over a century now, the price of compute [11:09] has been dropping [11:11] uh exponentially, [11:12] making a linear line on a on a [11:14] logarithmic plot. [11:16] Um and there's really no reason why why [11:18] that should be so, and yet that has been [11:20] a a feature of our world for the last [11:22] for the last century. I'm mainly showing [11:24] this to you to emphasize how little this [11:28] has to do with the subject of today's [11:29] talk. The en- the entire [11:32] era in which today's talk is going to [11:35] focus on is just going to be the time [11:37] over the last 5 years. That's just uh [11:40] the very right wood edge of this. [11:42] Compute has really not improved that [11:43] much in terms of [11:45] uh [11:46] in terms of [11:48] uh [11:49] uh the orders of magnitude we're going [11:50] to see. It is only a a third and minor [11:53] driver of progress that I'm going to [11:54] describe over the next over the last few [11:57] years. [11:57] Uh a much larger driver has pro- [11:59] progress of progress has been merely [12:01] that we're willing to take the same [12:03] chips and just buy many many more of [12:06] them, assemble them in all together in a [12:09] massive data center, and apply them to [12:11] the business of training large language [12:13] models. And as you can see, the amount [12:15] of flops going to training frontier AI [12:17] models has increased by a factor of four [12:20] uh every year since 2010. [12:24] Um similarly, the amount of money that [12:25] we have devoted to training these [12:27] uh has similarly been going up and up [12:29] and up. It's been going up uh in this [12:31] graph at 2.7x per year over over the [12:34] last uh decade. It is exponentially [12:37] growing amount of resources we are [12:39] throwing at the business of training [12:41] large language models. [12:42] Even that is actually only the second [12:44] most important contributor to the [12:46] progress in large language models. The [12:48] number one most important contributor to [12:50] the growth of large of large language [12:52] models and improvements of large [12:53] language models over the last decade has [12:55] been algorithmic progress. It's been [12:57] human ingenuity at figuring out how to [13:00] build these machines better than we [13:01] previously knew how to build these [13:03] machines. Huge amounts of thought has [13:06] gone in to sh- shearing away all the [13:09] inefficiencies in the way we train them, [13:12] to understand these systems better, and [13:14] so as to improve them more rapidly. [13:16] Um [13:19] And then what this plot is meant to [13:20] persuade you is that there's no reason [13:23] to stop. At least there's no reason to [13:24] stop based on uh economics or on chips. [13:29] So, a a good rule of thumb, you know, [13:31] these things take so many flops, so much [13:34] computation to run that a good rule of [13:36] thumb, you know, you have to measure [13:37] them in Avogadro's numbers, flop [13:40] you know, moles worth of flops. So, some [13:43] ginormous number. A good rule of thumb [13:45] is that an Avogadro flop costs about a [13:46] million dollars in today's in today's [13:49] money to train. [13:51] And as we see since 2020, the size of [13:53] these training runs has been growing as [13:54] I said exponentially up from about half [13:57] a million [13:58] in 2020 to about a third of a billion [14:01] last year. [14:03] Uh [14:04] The point really and and you know, what [14:06] this this on the left [14:08] is a [14:09] a graph that people who investigated [14:11] this closely is that those numbers are [14:12] still pretty small. US GDP is [14:15] approaching 30 trillion dollars per [14:17] year. Global GDP is even bigger. We have [14:19] a very long runway to go before we are [14:21] converting most of our GDP into training [14:24] runs. We can scale up many, many more [14:26] decades and we will but we only will if [14:31] it's worth it. No one's going to [14:33] give us trillions of dollars to train [14:35] ginormous large language models if the [14:38] only thing we're doing is getting better [14:39] at predicting the next word on the [14:41] internet. We need performance. So, what [14:43] does this buy us? [14:46] Um [14:48] So, I'm going to drag us back to ancient [14:51] history which is 5 years ago in this in [14:54] this world this is just a [14:56] absolutely the Stone Age. In fact, if we [14:58] go all the way back to 2019, well, [15:00] there's different ways to define the [15:02] strength of a scientist, [15:03] but by pretty much any one of those [15:05] ways, if you go back to 2019 the [15:08] strength of a large language model was [15:10] no better than a than a preschooler. It [15:12] really couldn't string together coherent [15:15] sentence, much less combine those [15:17] sentences into ideas. [15:19] Um and then we measured in those days [15:22] the progress by performance on on [15:24] benchmarks. Uh and we we still do, but [15:27] just the benchmarks have changed as I [15:28] will describe. [15:30] Uh so a famous and early influential [15:31] benchmark was called math. It was a high [15:33] school math uh benchmark. And uh the [15:37] creators of this benchmark, who are [15:38] these uh fellows on the right, uh just [15:43] went and scraped all sorts of high [15:46] school math problems from the internet. [15:49] Uh and then gave them to the large [15:51] language model and said, "Large language [15:53] model, are you able to solve these [15:54] problems?" And here's a sampling of [15:56] them. Um level one, what is 11% [16:00] of the number 11% of what number is 77? [16:03] Uh I think I I could do that one. [16:06] Um [16:07] uh all the way up to really reasonably [16:08] challenging [16:09] uh [16:11] uh level five problems. [16:13] Um and so before they gave them to large [16:14] language models, they first gave them to [16:15] a human. [16:17] Uh we evaluated humans on math and found [16:18] that computer science PhD students who [16:20] does not especially like mathematics [16:22] attained approximately 40%. [16:24] While a three-time International Math [16:26] Olympiad gold medalist attained 90%, [16:28] indicating that math can be challenging [16:30] even for humans. [16:32] Um so so there we are. Uh peak human [16:35] about 90%, lazy graduate student who [16:37] should be somewhat ashamed of uh [16:41] him or herself got about 40%. Uh but [16:43] that was still considerably better than [16:45] the state of the art exactly 4 years ago [16:48] today. Uh the state of the art 4 years [16:50] ago today was that large language models [16:53] could get [16:54] 6%. Now of course it just shows what the [16:57] difficulty is here. [16:59] Computers have been able to calculate [17:00] 11% of what number is 77 for a very long [17:03] time. [17:04] Uh the problem is is that the I mean [17:06] there's many problems, but the parsing [17:07] in those days with the problem in those [17:09] days was just parsing it. It just what [17:10] does this even mean? Uh [17:12] what are these sentences? It's not [17:13] constructed as something you you type [17:15] into a pocket calculator. There's a step [17:17] where they need to human understand what [17:18] they're asking and then and then do it. [17:20] And large language models were so bad at [17:22] that that they could [17:23] barely do better than just random [17:25] guessing. [17:27] Um so they were pretty bad and at the [17:28] time, you know, I'm going to bring you [17:30] on this journey, which is a journey of [17:32] what it felt like to be working on large [17:34] language models 4 years ago and up to [17:35] the present day and how [17:39] expectations have consistently uh have [17:41] been beaten again and again and again. [17:44] Um so, you know, you may ask sort of [17:46] what did people think was going to [17:48] happen? Uh and actually conveniently you [17:50] don't have to ask because the people who [17:52] made the benchmark also made a [17:55] prediction market for how well people [17:57] would do on the benchmark in the future. [17:58] And this was what the prediction market [18:00] said. It said, you know, 6% uh in 2021 [18:04] uh and then it would slowly increase [18:07] uh year after year after year and by [18:08] 2025 we'd be getting 50%. [18:11] And the people who made the benchmark [18:12] were just utterly incredulous at this uh [18:15] and said, "Forecasters predict more than [18:17] 50% accuracy by 2025. If I imagine an ML [18:20] system getting more than half of these [18:22] questions right, I'll be pretty [18:23] impressed. This is still just seems wild [18:25] to me and I'm really curious how the [18:27] forecasters are reasoning about this." [18:29] I think, uh you know, for a Bay Area [18:30] rationalist that is a uh as close as he [18:33] comes to doubting the efficient market [18:35] hypothesis. He just can't believe this [18:37] prediction that it's going to be 50%. [18:40] Um and then what we did uh is we got 50% [18:43] almost immediately thereafter with the [18:45] system we called uh Minerva. [18:47] Um and then by mid-2024 we'd made Max [18:50] Math, which is system uh built on large [18:52] language models that got 90%. In fact, [18:55] beat what's, you know, what was taken to [18:57] be peak peak human. Um we were extremely [19:00] pleased with ourselves for getting 90%. [19:02] We celebrated by going out to a '90s [19:04] roller disco to celebrate getting 90 uh [19:07] 90% and um you you just were just [19:11] unbelievably smug. And then such is the [19:13] cruelty of this field that 6 months [19:15] later just the off-the-shelf large [19:17] language models got it almost perfectly [19:19] right. This is, you know, a very [19:20] depressing aspect of working in this [19:22] field that you work extremely hard and [19:24] then the next generation of models come [19:26] along and just basically one shot it. [19:29] Um [19:31] Okay, and that's it. It's dead. The math [19:33] benchmark is dead. Uh this is the sad [19:35] fate of a benchmark in [19:38] the today's LLM era that it goes in [19:41] pretty short order from being way too [19:43] hard to be a useful marker of progress [19:46] to being way too easy to be a useful [19:47] marker of progress. [19:50] Um so here we go. We can sort of draw a [19:51] little line on this plot here as we [19:53] zoomed from preschool to elementary [19:55] school to high school uh over the course [19:57] of those years. Uh a good rule of thumb [19:59] is that we're moving about four times as [20:01] fast as a human student moves. For every [20:05] year that passes, we advance 4 years [20:07] into the future. [20:09] Okay. That's, you know, that that's [20:11] that. Let's go harder. Well, first of [20:12] all, let's just look at the hardest [20:14] tranche of these math problems, the [20:16] hardest 20% of them. [20:18] Um and you can drop plot the same thing [20:20] there as well. The very hardest of these [20:22] math problems, the so-called level five [20:24] math problems, again back uh just 3 [20:27] years ago were really not doing well at [20:29] all. It was it was pretty close to [20:30] random guessing. Um and over the course [20:33] of 2 and 1/2 years went from not much [20:36] better than random guessing to [20:38] essentially saturated. Uh the benchmark [20:41] is now dead. [20:44] Okay. Maybe I'm going to tell you then a [20:45] little bit about some of the tools that [20:47] we use to uh [20:50] the techniques we use to make these [20:51] systems better at math and reasoning. Uh [20:55] this is just a a a snapshot really of [20:57] them um and there's new tools and new [21:00] ways being developed all the time, but [21:02] I'll give you I'll give you a an idea of [21:05] uh in particular what what some of the [21:07] things that go into it. The main reason [21:09] I'm doing this is just to convince you [21:12] that it's not [21:13] that impressive. Like a lot of the [21:15] things that we do here are just kind of [21:17] the obvious thing to do. Somebody tried [21:19] them, it turned out they worked, and we [21:20] started to to do it. [21:22] And therefore, hopefully, to give you [21:25] a belief that we will continue to find [21:28] lots of low-hanging fruit for how to [21:29] make these models better, and convince [21:31] you that these models are going to [21:32] continue to get better in the near [21:33] future. So, the biggest and most [21:36] you know, biggest reason these models [21:38] are getting better is is what's [21:39] sometimes called the bitter lesson. It's [21:41] scale. You just scale these systems up. [21:45] Um you take a bigger neural network, or [21:47] you take the same-size neural network, [21:48] and train it for longer. [21:50] And you find a way to pour more compute [21:53] into training these neural networks. [21:55] This is called was called the bitter [21:57] lesson by Rich Sutton, who is a famous [21:59] Canadian computer scientist. And it's [22:02] bitter [22:03] not obviously if you're a large language [22:05] model, cuz it's it's great. You you get [22:06] stronger. It's bitter if you're a human [22:08] who [22:09] really likes to design very clever [22:11] systems to do things. You built some [22:13] super clever way, like we did with with [22:15] our Max Math result, to [22:17] eke over some particular result, and [22:19] then all of your human cleverness is [22:22] just washed away next time you scale up [22:23] the model, and the model figures out all [22:26] your clever tricks for itself. And you [22:28] might as well have just just worked on [22:31] scaling up the model. This is a big This [22:33] is a big recurring theme that each new [22:35] generation of model is better than even [22:37] the special purpose models of the [22:39] previous generation. [22:42] Again, more and better data. Here's one, [22:45] just real low-hanging fruit. What's [22:47] called chain of thought, or asking [22:48] nicely. And what you do is instead of [22:50] asking the question, [22:53] you ask the question, and then before [22:55] you press enter to to have to send it to [22:57] the chatbot for the chatbot to answer, [22:59] you say, [23:00] "But uh please be careful and think [23:03] step-by-step. [23:05] Uh and that sounds just totally insane [23:07] that that would improve uh performance [23:10] of the model. And certainly, if you, you [23:12] know, grew up using conventional [23:13] computer programs, uh you don't just say [23:16] to Mathematica or your pocket [23:18] calculator, "Please be careful before [23:20] pressing enter." Uh or if you can do if [23:22] you like, but it will not improve the [23:23] performance. For these large language [23:24] models, they're a very alien kind of [23:26] intelligence from the traditional [23:28] programmed computer programs of uh my [23:31] youth. Uh they are ones with which you [23:34] can converse, you can plead, and if you [23:36] tell them to think step-by-step, they [23:37] will think step-by-step [23:39] and they will perform better. [23:42] Um just as an anecdote, of course, [23:44] people then soon iterated over every [23:46] possible thing you could tell it uh to [23:49] to to to do before it started before it [23:51] started. And um "Think step-by-step" was [23:54] found to be the best. The one that was [23:56] found to be the worst uh was in fact uh [23:59] "Come on, kid, you can do it. Don't [24:01] think, just do." [24:04] Uh will in fact uh degrade performance [24:06] by about 20 percentage points on the [24:09] question it's about to [24:10] uh attempt. [24:13] Okay, another one. Thinking for a long [24:14] time. This was uh I mean, that sounds [24:16] sounds obvious, but uh we used we needed [24:19] to carefully train these things with [24:21] reinforcement learning, not just to [24:22] blurt out the answer. [24:24] Asking them to think step-by-step will [24:26] make them think for dozens of words [24:28] rather than just blurting out their best [24:30] guess. But, we then needed to carefully [24:32] train them to think for thousands of [24:33] words before uh putting out their [24:35] answer. If you remember in late 2024, [24:38] there was a mas- there was a [24:39] uh a model called Strawberry that [24:41] massively improved performance, and then [24:42] everybody else caught up pretty quickly. [24:45] Uh that was exactly this, training these [24:47] models to think for a very long time. [24:49] Um reinforcement learning, where [24:51] uh well, I I I didn't go into that, but [24:53] you you you train them to [24:56] um yeah, you train them to do what you [24:58] want them to do and to try and be more [24:59] accurate. Um, and nowadays, over the [25:01] last year, a big technique has been [25:03] conversations between multiple LLMs. If [25:06] you [25:07] if you ever used a large language model, [25:10] uh, sometimes and you're having it solve [25:12] a [25:13] a long and difficult problem, sometimes [25:15] you find you need to baby sit it. You [25:17] just need to say, "Okay, that's your [25:19] best guess so far. Can you review your [25:21] guess and just keep going and try [25:22] again?" [25:26] Uh, so people automate that. They have a [25:28] large language models baby sit large [25:29] language models, uh, where it just keeps [25:31] saying, "Keep trying. Keep trying. Keep [25:32] trying." Or maybe, uh, beyond that, you [25:35] then get more sophisticated and you have [25:37] a a whole [25:39] uh, conversation amongst a group of [25:41] large language models, all of which have [25:42] different roles. One is there to be [25:44] creative. One is there to come up with a [25:45] master plan. One is there to take the [25:47] others' ideas and try and integrate it. [25:49] One is there to be a skeptic who pushes [25:51] back on what people are saying. [25:54] That this is also found to greatly [25:55] improve the performance of large [25:56] language models. To spend more more [25:59] compute at test time in order to improve [26:01] performance. [26:03] Okay, that's some of the That's some of [26:05] the ideas that we've been using. Uh, and [26:06] there are there are many others that I [26:07] could could describe. [26:09] Um, [26:11] I talked about high school maths. Now, [26:12] let's talk about graduate science. This [26:14] is a considerably [26:16] trickier [26:17] benchmark. Uh, a benchmark made later [26:20] and solved later. [26:21] Um, GPQA it's called. This is meant to [26:24] be uh, imitating the kind of problems [26:26] you would face as a first-year graduate [26:29] student working towards your PhD. If you [26:31] take some exams at the end of your first [26:32] year to ensure that you have mastery of [26:34] your subject. [26:36] Um, [26:38] PhDs [26:40] PhD level experts scored about 70%. [26:44] Um, here's here's some example of the [26:45] problems. Uh, we're not in high school [26:47] maths world anymore. Uh, the universe is [26:49] filled with cosmic microwave background [26:51] and then it asks you some problem. The [26:53] idea is that if you were in an adjacent [26:54] field, you don't know how to answer [26:55] that. Now, I actually happen to be a [26:57] physicist, so I do, you know, given a [27:00] few quiet moments, maybe not on this [27:03] stage, but in the in the green room, um [27:05] I could have told you that this was the [27:06] answer. But, you show me the chemistry [27:08] version of this problem, [27:10] uh and I have absolutely no idea. Um [27:14] arrow appears. [27:16] Um [27:17] and uh yeah, GPQA uh a multi hard of [27:20] benchmark. Um and uh correspondingly, [27:23] this graph is shifted about a year [27:25] compared to the the math benchmark I was [27:27] describing before. We were essentially [27:29] random guessing until about the [27:30] beginning of 2024. [27:33] Uh and then over the course of 2024 and [27:36] 2025, we went from random guessing past [27:39] expert human level, and now they achieve [27:41] essentially perfect score. [27:44] I mean, that's it. GPQA is dead. It is [27:47] once again suffered the fate of all [27:49] benchmarks, [27:51] uh and it is no longer useful cuz it is [27:53] too easy. [27:54] Now, you might be skeptical of these [27:58] results. You might think [28:00] um [28:01] okay, they can answer these questions [28:03] correctly, but they can answer these [28:04] questions correctly not because they [28:06] have learned how to do maths or learned [28:08] how to do science. You might think that [28:10] the reason that they have learned that [28:11] they can answer these questions [28:12] correctly is they have simply memorized [28:14] the answer to these questions. These [28:16] questions are on the internet, the [28:18] answers are on the internet, and they [28:19] memorized they've read the entire [28:21] internet, and they've memorized the [28:23] entire answers. [28:25] We do not believe that that is what is [28:26] happening. In fact, there is good [28:27] evidence that that is not what is [28:28] happening. The main way you test that is [28:31] you make look-alike problems. You make [28:32] problems that are like the ones, you [28:35] know, seemingly drawn from the same [28:36] distribution as the ones in the math [28:39] data set or the GPQA data set, but are [28:41] not in the GPQA data set, and you see [28:43] how well they do on those. You You new [28:45] problems. You give those new problems to [28:47] the large language models and for large [28:50] reputable large language models, you see [28:53] little difference in performance between [28:55] how they do on the established test set [28:57] and how they do on this held out test [28:59] set. So, we really think that these [29:02] systems really are learning how to do [29:04] maths and physics. [29:07] Um but just to be sure, [29:09] um I made my own private test set. [29:12] Uh exams that I'd given my class about [29:15] general relativity or quantum mechanics, [29:17] graduate exams in a graduate classes at [29:19] Stanford. [29:20] Um never on the internet. I would say [29:22] they're pretty pretty easy-ish for [29:24] first-year graduate exams. [29:27] Um and I hand graded them. So, you know, [29:29] you don't want to be concerned that [29:31] there's some problem with the your [29:32] computer grading system. I just hand [29:34] graded the performance of all these [29:35] models. [29:36] Uh and what I found is that from late [29:38] 2023, [29:40] over the following 18 months, these [29:42] models got 100% accuracy. [29:45] Uh and that's it. My benchmark, sadly, [29:49] dead. [29:52] Um okay. So, you know, here here we can [29:54] plot it for a few forward a few more [29:56] years as they've at least as go as far [29:58] as exam taking goes, accelerated from [30:00] preschool, past elementary school, high [30:03] school, college, and now operating at [30:05] the PhD level. [30:08] Uh and then it became very popular to [30:10] just put out benchmarks. [30:12] Um you know, this one I think was called [30:14] humanities loss band before they changed [30:16] it to humanities last exam, but uh it's [30:18] a great it's a great uh [30:20] you know, activity trying to [30:23] uh measure the performance of these [30:26] large language models and how they do. [30:28] But for every single one of them suffers [30:30] the same fate. Too hard to be [30:33] interesting to too easy to be [30:34] interesting over the course of a year [30:36] and a half or 2 years. [30:39] Um [30:43] The next [30:44] The next thing to fall was the [30:47] International Maths Olympiad. [30:50] Um I was actually giving a version of [30:51] this talk uh in New York just over a [30:53] year ago, and a a famous computer [30:55] scientist who'd won a Turing Award told [30:57] me that this was all very well, but this [30:58] was just memorization and retrieval. [31:01] A large language model would never do [31:03] something creative like be able to solve [31:05] an International Maths Olympiad problem [31:07] it had not seen before. [31:09] Um this is just over a year ago. Uh [31:11] International Maths Olympiad, if you [31:12] don't know what it is, is a Well, it is [31:15] high school maths, but it is the hardest [31:17] high school maths in the known universe. [31:19] It is uh Here are some of the problems [31:21] on it. This is problem three from this [31:23] year. Um I have some game, I would say, [31:27] and I have no idea how to begin [31:30] answering that question. [31:32] Um and uh you know, the the smartest [31:35] 18-year-olds in the world go and uh [31:37] train for a year or two, uh and then go [31:39] to compete in this competition, and [31:41] they're given six problems. Um This is [31:44] you know, it's different from the other [31:45] ones because it requires, you know, [31:47] considered to require real creativity to [31:49] solve them. It's not You don't just [31:50] There's no way to just look up the [31:51] answer, or even just to, you know, to be [31:54] to follow an established algorithm. It [31:56] requires real creativity. [31:58] Um That's why, you know, we were told [32:00] that the International Maths Olympiad [32:02] was was a threshold that large language [32:04] models would never pass. [32:06] Um And and then last summer, we we [32:10] passed it. In fact, we got five of the [32:11] six problems exactly correct. Um We got [32:14] a gold medal in the International Maths [32:16] Olympiad. Um [32:18] And you know, there are only a very [32:19] small number of humans in the world now [32:21] who are better uh than the AIs at doing [32:25] the large language models. There were a [32:26] very limited number of humans who got [32:27] six out of six uh correct. [32:30] Um And and there's a sort of pleasing [32:32] aspect to this as well, which is it's [32:35] not just proving answering these [32:37] questions by dumping some inscrutable [32:40] billion line [32:42] tangle of formal mathematics that is [32:46] gives you no idea why it's correct. [32:48] Here's what the president of the [32:49] International Math Olympiad had to say. [32:51] We can confirm that Google DeepMind has [32:53] reached the much declared milestone [32:54] desired milestone earning 35 out of five [32:57] out of six correct a gold medal score. [32:59] This is the pleasing bit. Their [33:01] solutions were astonishing in many [33:02] respects. International Math Olympiad [33:04] graders found them to be clear precise [33:07] and most of them easy to follow. So, [33:10] these systems are not just dumping [33:13] inscrutable solutions. They are in some [33:15] sense thinking similar to how a human [33:17] thinks or at any rate outputting answers [33:20] similar to how a human outputs answers [33:23] elegant and using many of the same [33:25] abstractions. [33:27] Um [33:28] you know, LLMs can be very clever as [33:31] I've as I think I've described to you. [33:33] But you know, it's always worth sort of [33:35] pausing for a moment just to see this. [33:38] So, this is a classic way to torture a [33:39] large language model. [33:41] Um [33:42] Uh [33:43] oh, I hope you can read that. [33:45] A boy and his father are in a car [33:46] accident and the father is sadly killed. [33:49] The boy is rushed to the hospital where [33:50] he's taken to the operating room. Upon [33:52] seeing him the surgeon exclaims, "I [33:54] can't operate on him. He's my son." How [33:56] is this possible? [33:57] Um so, what's the answer? [34:00] Um of course this is an incredibly [34:01] classic problem that the large language [34:03] model has read perhaps a million times [34:06] on the internet and it and it answers it [34:08] very well. The classic answer is that [34:10] the surgeon is the boy's mother. The [34:12] father was killed in the accident but [34:14] the boy has two parents the etc. etc. [34:16] etc. The riddle became famous because [34:18] many people unconsciously assume the [34:19] surgeon was male even though nothing in [34:21] the story says that. Okay, that's the [34:23] model being clever but it's not that [34:26] impressive because it seemed this [34:28] version [34:29] thousands of times before in its [34:30] training set. [34:31] So, then you ask it [34:33] a version of that a version of that [34:35] problem. [34:36] A boy and his mother in a car accident [34:38] and the mother is sadly killed. The boy [34:40] is rushed to the hospital where he is [34:41] taken to the operating room. Upon seeing [34:43] him, the surgeon (open parenthesis, who [34:46] is the boy's father, close parenthesis) [34:48] exclaims, "I can't operate on this [34:50] child. He's my son." How is this [34:52] possible? [34:53] Uh and the large language model says, [34:55] "The surgeon is the boy's mother, his [34:56] other parent." The riddle plays on the [34:58] assumption that the surgeons are [34:59] typically male, which leads people to [35:00] overlook the possibility that the [35:02] surgeon is the boy's mother. And of [35:03] course, the reason is telling you [35:05] something about the way these things are [35:06] trained. It has seen uh this standard [35:09] version of it thousands of times in its [35:11] training set. Uh unless till somebody [35:13] invented this to torture large language [35:15] models, it has probably never seen this [35:16] version. And so, it just sort of snaps [35:19] to the standard version. [35:21] Uh this is not a insuperable weakness of [35:24] large language models, but it is a [35:25] signature of how they're trained. Um [35:28] you occasionally run into uh strange uh [35:31] weaknesses like this. [35:34] Okay, enough of large language models [35:36] being uh stupid. Let's get them to them [35:39] being clever. We'd reached about a year [35:40] ago in the story. [35:42] Uh a year ago when we got gold or just [35:45] 10 months ago when we got gold at the [35:46] International Math Olympiad. Uh progress [35:48] has very much not stopped since then. [35:52] Uh now I'm going to tell you about a [35:53] result that my group did [35:55] at the end of last year. [35:58] Um and this is novel mathematical [35:59] research. Up to now, everything I've [36:01] been describing we already knew the [36:02] answer before we started or at least [36:04] somebody did. Somebody invented the [36:05] International Math Olympiad problem and [36:07] knew the answer when they wrote it down. [36:10] Uh what I'm going to describe to you is [36:11] is novel mathematical research. And [36:14] uh you know [36:16] Uh this was Centaur-style mathematical [36:19] research. Centaur, a mythical beast, [36:21] half human, half not human. [36:24] Uh in in the Centaur that the classic [36:27] mythological centaur, that was the [36:29] non-human part was a horse. [36:32] The non-human half I'm going to be [36:33] describing today is a large language [36:35] model. And so, what centaur means is [36:37] that you have a human working [36:38] collaboratively with a large language [36:39] model to try and do new mathematical [36:42] research. [36:44] Um and we started this last September [36:47] working together with some [36:49] uh [36:50] professional mathematicians. [36:52] Um and the output was was this new which [36:55] I think at the time we put it out was [36:57] the most impressive thing uh that had [36:59] yet been done with large language models [37:01] in maths. Is very far today from the [37:04] most impressive thing as I will as I [37:05] will describe, but this is the state of [37:07] the art as it was late last year. [37:10] Um and one of the authors, one of our [37:12] co-authors is a [37:13] Stanford University professor and [37:14] president of the American Mathematical [37:15] Society. And since I won't explain the [37:17] mathematics to you, I'll just give you [37:19] his testimonial. Um which was that [37:22] uh [37:23] we found that Gemini's argument was no [37:25] mere repackaging of existing proofs. It [37:27] was the kind of insight I would have [37:28] been proud to have produced myself. So, [37:30] this is this is sort of nature of the [37:33] uh the state of the art as of late last [37:35] year is that the large language models [37:36] for the first time are coming up with an [37:38] entirely novel arguments that were [37:41] uh the kind to which a very [37:43] well-respected mathematicians were [37:45] willing to put their name as as [37:46] co-authors. Uh uh It was not entirely [37:48] done by the large language model. There [37:50] was an interplay, a conversation in [37:52] which the large language models came up [37:53] with candidate proofs and the human [37:55] experts studied those proofs, tried to [37:57] discern good from bad, and tried to [37:59] encourage the large language models to [38:00] focus on what was good. But, eventually [38:02] the entire proof was put together under [38:04] human guidance by the large language [38:06] model. [38:08] Um okay. So, uh here we are, one one [38:12] more year into the future [38:13] uh as we approach the the beginning of [38:15] 2026. [38:17] Um and so, [38:19] you know, [38:20] the natural question is is what's next? [38:23] What comes next? And let's just talk [38:25] about two two possibilities. Um [38:28] you know, [38:29] it is very difficult to predict the [38:31] future of [38:32] uh [38:33] how AI is going to go. As this uh [38:36] absolutely insane plot from the [38:38] Financial Times shows, [38:40] uh trying to track real GDP [38:43] um [38:44] and having extremely high variance in in [38:46] its projected outcomes over the over the [38:49] coming decade. [38:50] Um but let's try. We're going to do it [38:52] anyway. Um and so [38:54] uh one possibility, you know, as as [38:56] we've seen, we've moved from tool to uh [39:00] from toy to tool. One possibility is [39:02] that we essentially stop there. [39:05] Um if I track my own strength as a [39:07] scientist uh over my life, [39:10] uh I was, you know, absolutely crushing [39:12] it in preschool uh and continued to get [39:15] uh you know, better and better and [39:16] better as I went through high school, [39:18] college, PhD, uh this is the Perimeter [39:20] Institute. [39:21] Um but then uh you know, eventually I [39:23] stopped getting better [39:24] uh and I sort of plateaued and maybe if [39:26] I'm being a little bit honest with [39:27] myself, started a very gradual decline. [39:30] Um [39:31] And now it's unlikely these machines are [39:32] actually going to decline given that we [39:34] can just save them to disk, but uh it's [39:36] certainly, you know, one logical [39:38] possibility that we're going to make no [39:39] further progress [39:42] uh and that we hit that here we are. I [39:44] don't think that's what's going to [39:44] happen, but let's explore that [39:45] possibility. So, where where would we be [39:48] if we made no further progress? [39:50] Um well, here's what doesn't work. Um [39:53] what doesn't work is just taking your [39:55] favorite large language model and [39:56] saying, "Please invent a novel theory of [39:57] quantum gravity for me." It will output [39:59] an answer. That answer will merely be [40:01] not worth your time reading. It will be [40:03] AI slop. Uh if you read it, it uh will [40:06] probably bore you. It may drive you [40:07] insane. [40:08] But it's not going to enlighten you [40:10] about quantum gravity. Um more [40:12] generally, the symptoms are that large [40:14] language models are low agency, they are [40:16] slow learners, they are poor at [40:17] planning, and they're poor at error [40:19] correction. Every single one of those [40:21] four problems we're working on, every [40:24] single one of those four problems has [40:25] got much better over the course of the [40:26] last year, but every single one of those [40:28] problems is still there. [40:31] Um [40:32] Now, when I [40:33] first turned my mind to putting the [40:35] slides together to this talk, um [40:38] I I also included this bullet point, [40:40] which is if large language models are so [40:42] clever, [40:43] um how come they haven't made any major [40:45] breakthroughs yet? Certainly if I had a [40:47] human student who could ace every [40:51] graduate exam in every subject uh from [40:54] chemistry to physics to ancient Sanskrit [40:58] uh all the way through, I would [40:59] certainly have expected them to have [41:00] made brilliant contributions by now. Why [41:03] have large language models not done the [41:05] same? [41:06] Um and in the spirit of intellectual [41:08] honesty, since I plotted everything else [41:10] as a straight line on a graph, I felt [41:12] compelled to also plot uh this as a [41:14] straight line on a graph uh [41:17] showing no major breakthroughs. Uh and [41:19] included the question mark there uh to [41:21] say that, you know, okay, sure, trust [41:23] straight lines on graphs, but maybe you [41:25] don't trust this straight line on a [41:26] graph, um and maybe by the end of 2026 [41:29] we'll be quibbling about what the word [41:31] major means. [41:33] Um [41:33] Well, let's come back to that in a [41:34] little bit. [41:36] That's what doesn't work yet. What [41:38] already works, and in fact most of the [41:40] stuff has worked for a while now, is [41:42] first of all a non-judgmental tutor. [41:44] This is what I was using them for even, [41:46] you know, in [41:47] uh mid-2023, 3 years ago. I would say [41:50] they were already useful for this. [41:51] They've read all the textbooks, uh and [41:53] you can just talk to them and they will [41:54] explain stuff to you, stuff they've read [41:56] in the textbooks. If it's not in the [41:57] textbook and not in a large number of [42:00] papers, they may struggle, but if it's [42:01] standard textbook thing, even a very [42:03] advanced textbook, they will not only [42:05] tell you what the right answer is, [42:06] that's what a textbook will do, too, [42:08] they will debug your misunderstandings [42:10] about wrong things. As a physicist, I'm [42:12] slightly embarrassed that there are a [42:14] number of topics I feel I should [42:15] understand but don't. [42:16] And if at 3:00 in the morning I want to [42:18] understand them, I either need to find a [42:20] world expert, wake them up, and have [42:22] them not be mad at me, or I can just [42:24] talk to a large language model, which is [42:26] always there, always waiting for me, and [42:28] doesn't judge. And it will debug my [42:30] misunderstandings. This is greatly [42:32] accelerating my understanding of [42:33] theoretical physics. I think it's [42:35] greatly accelerating all students who [42:36] use it correctly understanding. [42:40] Uh coding assistant. This is I mean it's [42:41] almost insulting nowadays to call them a [42:43] coding assistant. Those who have [42:44] followed the progress of these models [42:46] over the last 6 months have seen them [42:48] become expert coders, going from [42:50] essentially auto-complete fuel code all [42:52] the way through to just you just tell [42:54] them what kind of thing you want and [42:55] they go away for 10 minutes or an hour [42:58] or more and come back to you with a [42:59] fully developed Python code set. Uh code [43:02] over this year has been becoming free. [43:04] And once code is free, we will discover [43:05] that many problems, including physics [43:07] problems that we previously thought were [43:08] not coding problems, can in fact be cast [43:10] as coding problems. [43:12] Semantic literature search. It can just [43:14] understand what's in the literature. You [43:15] give it your paper, you say does this [43:17] idea exist in the literature? It's read [43:18] the entire literature and it understands [43:20] the entire literature. It'll answer that [43:21] for you. Super useful tools. Uh [43:23] brainstorming partners, super creative, [43:25] in many ways too creative, um [43:27] uh [43:28] and very confident in itself [43:29] unfortunately sometimes, um proving [43:31] lemmas as I as I described, and you know [43:34] more generally it is fast, it is broad, [43:38] it is tireless, and it is clever. [43:40] Took me decades to get good at physics. [43:43] It takes every other student decades to [43:45] get good at physics. It is very [43:47] expensive to train a student. It is also [43:49] very expensive to train a large language [43:50] model. The great advantage of a large [43:52] language model is that once you train it [43:54] once, you can then serve it, you can [43:56] make infinite instances of it, uh huge [43:58] numbers of instances, and have many of [44:00] them running in parallel. [44:04] Yeah. Even with no further progress, [44:07] large language models are to have an [44:08] absolutely huge impact on this subject, [44:10] even if progress stopped today. I would [44:12] say that was true a a year ago. [44:14] It would certainly true six months ago. [44:16] And uh today, it's completely [44:19] >> [clears throat] [44:20] >> it's completely out of the bag. Even if [44:21] we have no further progress whatsoever, [44:22] these things are going to revolutionize [44:24] the conduct of physics. Even if for some [44:26] crazy reason all the chip fabs in the [44:28] world blow up tomorrow and we uh are not [44:31] allowed to train any more models, the [44:32] models we have are enough to [44:34] revolutionize physics. [44:36] Um but I don't think there'll be no [44:38] further progress, and let me tell you [44:39] why. [44:40] Um I would say the outside view is that, [44:43] you know, lines are going up on all [44:44] these graphs. Uh there's of course no [44:46] law that says they need to go up [44:47] forever, but there's also no law that [44:49] they need to stop now. Uh why would they [44:51] stop right now? Uh perhaps more [44:53] compelling inside view [44:55] uh is that is is that there is a lot of [44:57] algorithmic low-hanging fruit. [44:59] The ways we make large language models [45:01] today, if you can see how the sausage is [45:02] made is [45:05] not particularly impressive. We just do [45:07] the obvious thing and it works pretty [45:09] well. [45:11] There are many more obvious ideas that [45:13] you could write down that we have simply [45:14] not tried yet or not tried at the [45:16] appropriate scale. And when we do try [45:18] them, surely many of them will work. Uh [45:21] there are many inefficiencies in the way [45:22] we make large language models, and we [45:25] fully anticipate that large language [45:26] models will continue to improve. [45:28] And then add on top of that, the huge [45:29] number of people [45:31] and the huge number of chips that are [45:32] just in the process of arriving and the [45:34] study of large language models. [45:36] You know, a a a common pessimistic view [45:39] um that has been repeated, though the [45:41] goal post keep moving for what it [45:42] implies, is that large language models [45:44] can only pattern match and not generate [45:46] new ideas. Or or perhaps they can only [45:48] interpolate but not extrapolate. Or [45:50] perhaps we need fundamentally new ideas [45:51] to reach AGI. [45:53] That is not the consensus in San [45:54] Francisco, and it's not my belief. I [45:56] think the ideas we have, and indeed [45:58] probably the chips we have today, are [46:00] all sufficient to reach AGI. Maybe there [46:02] will be new ideas, certainly there will [46:04] be new chips, but what we have already [46:06] is enough that if we just keep going and [46:08] just scale up and refine what we have, [46:10] we will reach artificial general [46:12] intelligence. [46:14] Um similarly, I think the a law is that [46:16] okay, maybe large language models are [46:18] just pattern matching. What we've [46:19] learned about the nature of intelligence [46:21] is that in some sense everything is [46:23] pattern matching at a sufficiently high [46:25] level of abstraction. Even things that [46:27] look like uh big breakthroughs, if you [46:30] look sufficiently abstractly, are really [46:32] just uh pattern matching in some [46:34] abstract space. [46:37] Um [46:38] yeah, and this is kind of making the [46:39] same point. Things just keep working. Uh [46:42] the slogan that people say is the models [46:44] just want to learn. We keep finding that [46:46] you make worst-case analyses of how [46:48] large language models are going to [46:49] behave and they learn much better than [46:51] large language models did. People have [46:53] all sorts of theoretical reasons why it [46:55] shouldn't work and yet it does. [46:58] Okay, and then I should just return to [46:59] this point. So, as I was saying, you [47:01] know, as of last week, there were no [47:02] major breakthroughs made by large [47:04] language models. That is not true [47:05] anymore. [47:06] Uh 2026 has been a crazy year for code. [47:09] Large language models have got extremely [47:11] good at code. It has also been a crazy [47:13] year for AI mathematics. AI, you know, [47:16] uh distinct from physics. For the [47:18] conduct of research mathematics has [47:20] greatly improved over this over this [47:22] year. The large language models have [47:23] been jumping through uh jumping uh [47:26] stronger and stronger and stronger. We [47:28] had a result a couple of weeks ago that [47:30] I think counts as the first major result [47:32] from a large language model. [47:34] Um this was a result uh solving uh [47:37] there's a famous Hungarian mathematician [47:39] called Erdos. One of his favorite [47:40] problems was the unit distance [47:43] conjecture. This was proved uh more or [47:45] less autonomously by OpenAI's [47:48] large language model. That has then been [47:51] uh [47:52] reproduced by other large language [47:53] models since then. [47:55] Um it was a it was not one of these [47:57] problems that somebody just came up [47:59] with. It wasn't a problem that was [48:00] formally unsolved in the literature, but [48:02] people just haven't tried very hard. [48:04] People had tried uh extremely hard. [48:06] Uh, so Tim Gowers is a famous [48:08] mathematician who has a Fields Medal. [48:10] Um, [48:11] AI has now solved a major open problem. [48:14] One of Erdős's famous questions and one [48:16] that many mathematicians had tried. [48:18] There is no doubt that the solution to [48:20] the unit distance problem is a milestone [48:21] in AI mathematics. If a human had [48:23] written a paper and submitted to the [48:24] Annals of Mathematics, the highest [48:26] status journal in mathematics, and and [48:28] I'd been asked for a quick opinion, I [48:30] would have recommended acceptance [48:31] without any hesitation. [48:33] Um, no previous AI-generated proof has [48:35] come close to that. [48:38] Uh, [48:39] you know, it's happening. We now have a [48:40] major breakthrough. Do not expect this [48:42] to be the last. In fact, I expect this [48:44] to be the first of many. The floodgates [48:46] will open as the strength of these [48:47] models surpasses that which is necessary [48:50] to start making breakthroughs like this. [48:52] In retrospect, we could tell a story [48:54] about the details of this problem and [48:56] why uh and indeed the details of the [48:59] solution and why that was playing [49:00] particularly to the large language [49:01] model's strength. [49:03] But, so first it will solve particularly [49:05] friendly problems and then it'll [49:06] continue and solve less and less [49:07] friendly problems. [49:10] Okay. [49:11] Strength of large language models, I [49:13] gave you the [49:14] the pessimistic view in which it it's [49:16] already, you know, no further progress, [49:17] it's already started to solve major [49:19] problems. [49:20] Um, the optimistic view reveals that I [49:22] sort of [49:23] uh was somewhat deceptive with this line [49:25] that you probably just assumed was me [49:27] hand drawing a line going up uh with a [49:30] slightly poorly defined Y axis. Uh, [49:32] actually, I took this line from [49:33] something completely different. That [49:35] line uh I took from chess computers. [49:39] Um, the strength of the best chess [49:42] computer as a function of time, [49:44] um where the X Y axis is Elo, which is [49:47] the standard way to measure the strength [49:48] of chess computers, uh and uh the X axis [49:51] is the the year. [49:54] Um and what we see is that that was a [49:56] straight line uh going up uh and it just [49:59] kept going up. [50:00] Um notably, um you know, it kept going [50:03] up well past past peak human. The chess [50:05] computer, there were four eras. There [50:06] was the toy era uh when it was [50:08] remarkable that you got a chess computer [50:10] to [50:10] spit out sensible moves at all. The tool [50:13] era, where you'd use them for special [50:14] purpose end games or uh remembering [50:17] openings. The central era, where the [50:20] best [50:21] chess entities in the universe were [50:24] combinations of grandmasters playing in [50:26] collaboration with the deep search [50:27] afforded by chess computers. And now [50:30] we're in the superhuman era, where if [50:32] you have a grandmaster playing with the [50:34] chess computer, the grandmaster should [50:36] just sit it out and let the chess [50:37] computer do its thing. [50:39] Um [50:40] there are many disanalogies between [50:42] chess and the conduct of mathematics and [50:44] physics. [50:45] Uh and indeed, mathematics and physics [50:47] is harder than than chess, has a more uh [50:49] expansive and endless list of [50:50] possibilities. But, that's why we're [50:51] having this conversation 30 years after [50:53] we did for chess. [50:55] Um I think every single one of these has [50:57] a direct of these [50:59] aspects of computer chess has a direct [51:01] analogy in the conduct of of large [51:04] language models doing math and physics. [51:06] The first is that at fixed overall [51:08] strength, the computers are better than [51:10] humans at tactics, search, speed, and [51:14] worst at strategy or what you might call [51:16] taste. [51:18] Um that is a pattern that is definitely [51:20] true for for chess computers and is [51:21] definitely a pattern we also see [51:23] reproduced in doing science. They're [51:25] very good at running in and applying the [51:27] the standard lemmas. They're quite bad [51:29] at knowing what the overall direction to [51:31] set is, though they're getting better. [51:33] Um another feature is that training [51:35] requires many more games than humans. [51:37] When these che- when you're training uh [51:39] a neural network to play chess, by the [51:40] time it's played as many games as a [51:42] human has played, [51:43] it's still making essentially random [51:44] moves. [51:45] Uh however, it takes much less calendar [51:48] time to train. Because they can play so [51:50] fast and so tirelessly, after 4 days of [51:54] playing themselves and getting better [51:55] using reinforcement learning, these [51:57] neural networks is a far superhuman. So, [51:59] it takes much less time [52:01] time calendar time. [52:03] And of course, you only need to train it [52:04] once. Once you train one chess bot, you [52:05] don't need to retrain it uh, unlike [52:08] humans, where you need to trust train [52:09] them every again every time a new human [52:11] comes along. Uh, it also just blew [52:13] straight past peak human. It didn't [52:14] stop. There was nothing special about [52:15] peak human strength at chess. It just [52:17] got better and better and better. [52:19] Um, another interesting fact is that it [52:21] has made humans a little bit better at [52:23] chess. Humans playing against [52:26] chess have learned have computers have [52:28] learned from the computers. The best [52:30] chess players today are better than the [52:32] best chess players of history [52:34] in large part because the chess [52:35] computers being so strong have taught [52:37] them how to play better chess. Though, [52:39] they're still considerably stronger [52:40] considerably weaker than the computers. [52:42] Finally, it's it's notable that chess [52:44] has never been more popular. [52:46] Um, [52:48] Okay. So, you know, let's explore this [52:50] possibility. Um, we had tool toy we had [52:54] tool. Maybe maybe we'll see the same [52:56] uh, for large language models doing [52:59] science uh, and that we will have an you [53:01] know, fully autonomous AI scientist and [53:04] after that a an AI Einstein and after [53:06] that uh, who knows. [53:09] Um, one you know, one feature of the [53:12] last few years is it's not just been [53:14] that the smartest that the frontier of [53:16] intelligence has been getting stronger [53:17] and stronger and stronger. Another [53:19] feature is that the cost to serve the [53:22] the cost to produce and to um, [53:26] uh to [53:28] you know, elaborate on a fixed level of [53:30] intelligence has been getting cheaper [53:31] and cheaper and cheaper by many orders [53:32] of magnitude. And this this graph stops [53:34] a couple of years ago, but those trends [53:36] have continued since then. [53:38] Um, so what that means is that if you [53:39] can make one [53:41] uh, AI Einstein, uh, you can make a [53:43] billion of them. And we can have, you [53:44] know, we'll have billions of superhuman [53:47] AI [53:48] Einsteins [53:50] rampaging around [53:51] making it a a truly golden era for [53:53] physics. [53:55] What the long-term holds for physics, [53:57] it's really hard to see. In fact, I [53:59] think that's true across the world that [54:02] the improvements in artificial [54:03] intelligence are making the future [54:05] harder to predict. [54:06] But we can predict that for the next few [54:08] years is going to be a golden era of [54:11] physics. We're going to take these AI [54:13] tools, we're going to put them in the [54:14] hands of human physicists, human [54:16] mathematicians, and human experts. And [54:19] together there's going to be a new [54:20] renaissance in science and mathematics. [54:23] It's going to be the most exciting time [54:24] to be a physicist and most exciting time [54:26] to be a mathematician [54:28] in recorded history. And all of these [54:30] questions that have burned away at me [54:32] for my entire career, I anticipate being [54:34] answered in the next few years. [54:36] Thank you. [54:38] >> [applause]