[0:00] DeepMind's new AI just did something
[0:02] amazing, or did it? You see, there was a
[0:05] legendary mathematician called Paul
[0:07] Erdős, fellow Hungarian, who left more
[0:10] than a thousand open problems to the
[0:12] world to solve. Look, we Hungarians have
[0:16] a lot of problems.
[0:17] We got to contribute somehow, and this
[0:19] is our way of doing it. Now, DeepMind's
[0:22] new AI called AlphaProof Nexus tried to
[0:25] solve about 350 of them and came up with
[0:29] a 95.7%
[0:31] failure rate. Basically, it solved nine,
[0:34] and it only cost a couple hundred
[0:36] dollars per problem. Is that good? Well,
[0:39] I got to say, that is incredibly super
[0:42] good. Why? Well, these are decades-old
[0:45] problems that were not solved by anyone
[0:47] yet. The other line of criticism I hear
[0:49] is that this did not do fundamentally
[0:52] new things. Is that a problem?
[0:55] I think not. Why? Well, let's look back
[0:58] to 4 years ago. GPT-3. People said,
[1:01] "Well, it can't even add numbers
[1:04] together reliably." Then, 2 years ago,
[1:07] people said, "Well, it can't even solve
[1:09] high school competition problems
[1:11] reliably." Then, 1 year ago, people
[1:14] said, "It can't even win the
[1:16] Mathematical Olympiad gold medal
[1:18] reliably." And today, they are saying,
[1:20] "Well, it can't even solve 50-year-old
[1:24] unsolved problems reliably."
[1:27] Do you see where this is going? It is
[1:29] clear as day. Please apply the first law
[1:32] of papers here. It says, "Do not look at
[1:35] where we are. Look at where we will be
[1:37] two more papers down the line." And this
[1:39] result is absolutely amazing, stunning,
[1:42] even. So, how did they do it? How is
[1:45] that even possible? Dear fellow
[1:47] scholars, this is Two Minute Papers with
[1:49] Dr. Károly Zsolnai Fehér. Normally, you
[1:51] would reach out to some AI assistant to
[1:54] take a crack at it, but it won't solve
[1:56] it because they hallucinate and make
[1:58] things up. To avoid that, they make it
[2:00] use Lean, a formalized mathematical
[2:03] language where it's easy to check
[2:05] whether your proofs are correct. Is this
[2:07] new? Not at all.
[2:09] Everyone is doing that today. Okay, so
[2:12] what's new here? Look, first, a
[2:14] mathematician writes down the problem in
[2:16] Lean and the solution. The proof is left
[2:19] blank. Then, the AI agent tries to solve
[2:22] it. Of course, it fails. Too hard. Then,
[2:26] another AI checks it and says, "Mhm,
[2:28] this is not great." But, it also says
[2:31] why it's not great. But, here's the key,
[2:34] this guy right here. This is a cheaper
[2:37] judge AI that reads two previous
[2:39] solutions and picks a winner. Both
[2:42] solutions can be wrong, but it picks the
[2:45] one that is a bit better. Now, this is
[2:48] genius. Why? Well, hold on to your
[2:51] papers, fellow scholars, because it's
[2:53] kind of like a chess system where the
[2:55] solutions are the players and each of
[2:58] these players gets an ELO score, also
[3:01] named after Arpad Elo, fellow Hungarian.
[3:04] Look, sometimes we provide solutions,
[3:07] too. So, each proof now has a score. And
[3:10] now, we start again. But, not from
[3:13] scratch. No, no, no. We start out from
[3:16] the highest scoring bad solution. So,
[3:19] this is now a tournament. Do this over
[3:22] and over again. So cool. And now, we
[3:25] keep running and running this tournament
[3:27] until the validator says, "Yep, this one
[3:30] checks out." And then, we have a formal
[3:33] proof. Nailed it. This is incredible
[3:36] because it takes an unreliable AI, runs
[3:39] it over and over again, and it can lie
[3:42] its rear end off as much as it wants,
[3:45] and we still get a reliable system out
[3:48] of this. A reliable system built out of
[3:51] unreliable parts. I love that. And the
[3:54] fact that they put all this research out
[3:56] there in the open for free for all of
[3:59] us.
[4:00] Chef's kiss. Thank you so much for
[4:02] everyone who worked on this. What a time
[4:04] to be alive. But wait, interestingly,
[4:08] the story of AI so far has been that we
[4:11] make it smarter. Now, the story has
[4:13] changed. We don't need to make it
[4:15] smarter, we need to make the harness
[4:17] around it tighter. Give it a good judge.
[4:20] Let it a thousand times and it will
[4:23] slowly work out the right solution to
[4:26] incredibly hard problems. So here, the
[4:28] intelligence is not just in the model,
[4:31] but it is in the loop around it.
[4:34] Everyone is experimenting with different
[4:36] kinds of loops and it is super fun. I do
[4:39] it too on lambda. Okay, not even this
[4:41] technique is perfect. Limitations. In
[4:44] other words, the stuff that you don't
[4:45] hear about in mainstream media. So one,
[4:49] why not test on the full 1200 Erdős
[4:52] problems? Well, there is a little
[4:54] selection bias here. I think they took a
[4:57] subset of 350 that was easier to
[5:00] formalize. Is that a problem? In my
[5:03] eyes, not at all.
[5:04] You got to start somewhere. Let's not be
[5:06] one of those people that say, well, it
[5:09] can't even solve the 50-year-old
[5:11] unsolved problems reliably. What it has
[5:14] achieved is incredible. Now, two,
[5:17] smaller models solved zero problems.
[5:19] Zero. Nothing. You still need a beefy AI
[5:23] system at the core. That is an
[5:25] interesting case because people keep
[5:27] showing these benchmarks where the super
[5:29] fast cheap model is just a couple
[5:32] percentage points away from the
[5:33] frontier. And whenever I try them, they
[5:36] always seem a great deal weaker. This
[5:39] seems to reinforce that. Also, people
[5:41] will probably start thinking, do I use a
[5:44] larger model with fewer tournament
[5:47] rounds or do I use a smaller one with
[5:50] more? Assume that they cost the same.
[5:53] Interesting question. Now, where does
[5:55] this put us? Well, an AI just solved
[5:58] nine math problems that no human could
[6:00] crack in 56 years for a couple of
[6:04] hundred dollars each, and they did it by
[6:06] letting an unreliable AI fail thousands
[6:09] of times against a judge that cannot
[6:12] lie. And we went from can't even add
[6:15] numbers to solving decades-old open
[6:18] problems in the span of four years. And
[6:21] I think that is insane. But, limitations
[6:24] apply. Also, models used to be the only
[6:27] thing that matters. Now, harnesses,
[6:29] loops around them, also matter. Now, I
[6:32] recently talked to Pushmeet, one of the
[6:34] leaders of the project, and he's
[6:36] amazing. I am just a student who loves
[6:39] to travel the world and tries to learn
[6:41] from incredible scientists like him and
[6:44] bring that knowledge to you fellow
[6:45] scholars. And it [clears throat] is a
[6:47] huge honor for me to be able to talk
[6:49] about it to such a super smart audience
[6:51] as you fellow scholars. Subscribe and
[6:54] hit the bell if you feel that this is
[6:56] the way of doing it. Thank you so much
[6:58] for being with me all these years and
[7:00] over more than a thousand videos. We
[7:03] need new tools for the era of LLMs, and
[7:07] Weights & Biases now has weave, a
[7:09] lightweight toolkit to confidently
[7:11] iterate on LLM applications. Use traces
[7:14] to debug how data flows through each
[7:17] step of your app, and use evaluations to
[7:19] measure your progress. It is the best.
[7:22] Try it out now at wnb.me/papers,
[7:27] or click the link in the description
[7:29] below.