[0:00] DeepMind's new AI just did something [0:02] amazing, or did it? You see, there was a [0:05] legendary mathematician called Paul [0:07] Erdős, fellow Hungarian, who left more [0:10] than a thousand open problems to the [0:12] world to solve. Look, we Hungarians have [0:16] a lot of problems. [0:17] We got to contribute somehow, and this [0:19] is our way of doing it. Now, DeepMind's [0:22] new AI called AlphaProof Nexus tried to [0:25] solve about 350 of them and came up with [0:29] a 95.7% [0:31] failure rate. Basically, it solved nine, [0:34] and it only cost a couple hundred [0:36] dollars per problem. Is that good? Well, [0:39] I got to say, that is incredibly super [0:42] good. Why? Well, these are decades-old [0:45] problems that were not solved by anyone [0:47] yet. The other line of criticism I hear [0:49] is that this did not do fundamentally [0:52] new things. Is that a problem? [0:55] I think not. Why? Well, let's look back [0:58] to 4 years ago. GPT-3. People said, [1:01] "Well, it can't even add numbers [1:04] together reliably." Then, 2 years ago, [1:07] people said, "Well, it can't even solve [1:09] high school competition problems [1:11] reliably." Then, 1 year ago, people [1:14] said, "It can't even win the [1:16] Mathematical Olympiad gold medal [1:18] reliably." And today, they are saying, [1:20] "Well, it can't even solve 50-year-old [1:24] unsolved problems reliably." [1:27] Do you see where this is going? It is [1:29] clear as day. Please apply the first law [1:32] of papers here. It says, "Do not look at [1:35] where we are. Look at where we will be [1:37] two more papers down the line." And this [1:39] result is absolutely amazing, stunning, [1:42] even. So, how did they do it? How is [1:45] that even possible? Dear fellow [1:47] scholars, this is Two Minute Papers with [1:49] Dr. Károly Zsolnai Fehér. Normally, you [1:51] would reach out to some AI assistant to [1:54] take a crack at it, but it won't solve [1:56] it because they hallucinate and make [1:58] things up. To avoid that, they make it [2:00] use Lean, a formalized mathematical [2:03] language where it's easy to check [2:05] whether your proofs are correct. Is this [2:07] new? Not at all. [2:09] Everyone is doing that today. Okay, so [2:12] what's new here? Look, first, a [2:14] mathematician writes down the problem in [2:16] Lean and the solution. The proof is left [2:19] blank. Then, the AI agent tries to solve [2:22] it. Of course, it fails. Too hard. Then, [2:26] another AI checks it and says, "Mhm, [2:28] this is not great." But, it also says [2:31] why it's not great. But, here's the key, [2:34] this guy right here. This is a cheaper [2:37] judge AI that reads two previous [2:39] solutions and picks a winner. Both [2:42] solutions can be wrong, but it picks the [2:45] one that is a bit better. Now, this is [2:48] genius. Why? Well, hold on to your [2:51] papers, fellow scholars, because it's [2:53] kind of like a chess system where the [2:55] solutions are the players and each of [2:58] these players gets an ELO score, also [3:01] named after Arpad Elo, fellow Hungarian. [3:04] Look, sometimes we provide solutions, [3:07] too. So, each proof now has a score. And [3:10] now, we start again. But, not from [3:13] scratch. No, no, no. We start out from [3:16] the highest scoring bad solution. So, [3:19] this is now a tournament. Do this over [3:22] and over again. So cool. And now, we [3:25] keep running and running this tournament [3:27] until the validator says, "Yep, this one [3:30] checks out." And then, we have a formal [3:33] proof. Nailed it. This is incredible [3:36] because it takes an unreliable AI, runs [3:39] it over and over again, and it can lie [3:42] its rear end off as much as it wants, [3:45] and we still get a reliable system out [3:48] of this. A reliable system built out of [3:51] unreliable parts. I love that. And the [3:54] fact that they put all this research out [3:56] there in the open for free for all of [3:59] us. [4:00] Chef's kiss. Thank you so much for [4:02] everyone who worked on this. What a time [4:04] to be alive. But wait, interestingly, [4:08] the story of AI so far has been that we [4:11] make it smarter. Now, the story has [4:13] changed. We don't need to make it [4:15] smarter, we need to make the harness [4:17] around it tighter. Give it a good judge. [4:20] Let it a thousand times and it will [4:23] slowly work out the right solution to [4:26] incredibly hard problems. So here, the [4:28] intelligence is not just in the model, [4:31] but it is in the loop around it. [4:34] Everyone is experimenting with different [4:36] kinds of loops and it is super fun. I do [4:39] it too on lambda. Okay, not even this [4:41] technique is perfect. Limitations. In [4:44] other words, the stuff that you don't [4:45] hear about in mainstream media. So one, [4:49] why not test on the full 1200 Erdős [4:52] problems? Well, there is a little [4:54] selection bias here. I think they took a [4:57] subset of 350 that was easier to [5:00] formalize. Is that a problem? In my [5:03] eyes, not at all. [5:04] You got to start somewhere. Let's not be [5:06] one of those people that say, well, it [5:09] can't even solve the 50-year-old [5:11] unsolved problems reliably. What it has [5:14] achieved is incredible. Now, two, [5:17] smaller models solved zero problems. [5:19] Zero. Nothing. You still need a beefy AI [5:23] system at the core. That is an [5:25] interesting case because people keep [5:27] showing these benchmarks where the super [5:29] fast cheap model is just a couple [5:32] percentage points away from the [5:33] frontier. And whenever I try them, they [5:36] always seem a great deal weaker. This [5:39] seems to reinforce that. Also, people [5:41] will probably start thinking, do I use a [5:44] larger model with fewer tournament [5:47] rounds or do I use a smaller one with [5:50] more? Assume that they cost the same. [5:53] Interesting question. Now, where does [5:55] this put us? Well, an AI just solved [5:58] nine math problems that no human could [6:00] crack in 56 years for a couple of [6:04] hundred dollars each, and they did it by [6:06] letting an unreliable AI fail thousands [6:09] of times against a judge that cannot [6:12] lie. And we went from can't even add [6:15] numbers to solving decades-old open [6:18] problems in the span of four years. And [6:21] I think that is insane. But, limitations [6:24] apply. Also, models used to be the only [6:27] thing that matters. Now, harnesses, [6:29] loops around them, also matter. Now, I [6:32] recently talked to Pushmeet, one of the [6:34] leaders of the project, and he's [6:36] amazing. I am just a student who loves [6:39] to travel the world and tries to learn [6:41] from incredible scientists like him and [6:44] bring that knowledge to you fellow [6:45] scholars. And it [clears throat] is a [6:47] huge honor for me to be able to talk [6:49] about it to such a super smart audience [6:51] as you fellow scholars. Subscribe and [6:54] hit the bell if you feel that this is [6:56] the way of doing it. Thank you so much [6:58] for being with me all these years and [7:00] over more than a thousand videos. We [7:03] need new tools for the era of LLMs, and [7:07] Weights & Biases now has weave, a [7:09] lightweight toolkit to confidently [7:11] iterate on LLM applications. Use traces [7:14] to debug how data flows through each [7:17] step of your app, and use evaluations to [7:19] measure your progress. It is the best. [7:22] Try it out now at wnb.me/papers, [7:27] or click the link in the description [7:29] below.