AI solves 9 unsolved math problems in 56 years
42sHigh contrast between 95.7% failure rate and solving decades-old problems creates a shocking and engaging hook.
▶ Play ClipDeepMind's AlphaProof Nexus tackled 350 unsolved math problems by Paul Erdős, achieving a 95.7% failure rate by solving 9. Despite this low success rate, the AI's ability to solve decades-old unsolved problems at low cost is considered remarkably good.
DeepMind's AI addressed 350 of Paul Erdős's 1,000+ unsolved problems, succeeding on only 9 with a 95.7% failure rate. Cost was a few hundred dollars per solved problem.
Past criticisms of AI (unable to add numbers, solve high school problems, win Olympiad) have been refuted step by step. Current criticism is inability to solve 50-year-old problems, suggesting rapid progress.
Encourages focusing on future potential (two more papers down the line) rather than current limitations. Present result is described as 'absolutely amazing'.
Uses Lean (formal mathematical language) to avoid AI hallucinations. A mathematician inputs problem and solution outline with blank proof. AI attempts, fails; another AI judges and provides feedback. A cheaper judge AI selects better solution from two, creating a tournament with ELO-style scoring. Process iterates until a formal proof is validated.
The system converts unreliable AI into reliable system by repeated tournament iterations and a trustworthy judge, enabling solution of hard problems without requiring AI to be always correct.
Current paradigm: enhancing the 'harness' (loop) around AI, not just making AI smarter. The intelligence is in the loop, not solely in the model.
348 problems selected may be easier to formalize. Smaller AI models solved 0 problems, indicating need for large models. Trade-off: larger model vs. more tournament rounds for same cost.
From 4 years ago (can't add numbers) to solving decades-old problems. Harnesses and loops now matter alongside models.
DeepMind's AI demonstrated remarkable progress by solving unsolved math problems through an innovative tournament system that harnesses an unreliable AI with a trustworthy judge, highlighting a shift from improving model intelligence to optimizing the surrounding harness.
"Title accurately reflects that the AI 'found a strange new way' (tournament-based loop), though 'found' may imply discovery rather than designed approach."
How many of the 350 Erdős problems did DeepMind's AI solve?
9 problems.
0:34
What is the First Law of Papers?
Do not look at where we are; look at where we will be two more papers down the line.
1:32
What formal language does the AI use to avoid hallucination?
Lean.
2:03
How does the system convert an unreliable AI into a reliable one?
By running it repeatedly in a tournament against a judge that cannot lie, iteratively improving solutions until validated.
3:36
What does the system give each proposed proof?
An ELO score, like in chess.
3:00
True or false: Smaller AI models solved some problems in this study.
False (solved zero problems).
5:17
What is the 'harness' concept introduced in the video?
The intelligence is not just in the model but in the loop around it; we need to make the harness tighter rather than just the model smarter.
4:08
95.7% Failure Rate but Still Successful
Demonstrates that solving 9 out of 350 unsolved problems is considered a breakthrough because each problem was previously unsolved for decades.
0:29First Law of Papers
Provides a mental model for evaluating AI progress, focusing on future potential rather than current flaws.
1:29Reliable System from Unreliable Parts
Exemplifies a fundamental engineering principle that can be applied to many AI systems.
3:36Shift from Smarter AI to Better Harness
Highlights a major paradigm shift in AI research that affects how systems are designed and evaluated.
4:08Rapid Progress in Four Years
Shows exponential improvement in AI's mathematical capabilities, from simple arithmetic to decades-old open problems.
6:21[00:00] DeepMind's new AI just did something
[00:02] amazing, or did it? You see, there was a
[00:05] legendary mathematician called Paul
[00:07] Erdős, fellow Hungarian, who left more
[00:10] than a thousand open problems to the
[00:12] world to solve. Look, we Hungarians have
[00:16] a lot of problems.
[00:17] We got to contribute somehow, and this
[00:19] is our way of doing it. Now, DeepMind's
[00:22] new AI called AlphaProof Nexus tried to
[00:25] solve about 350 of them and came up with
[00:29] a 95.7%
[00:31] failure rate. Basically, it solved nine,
[00:34] and it only cost a couple hundred
[00:36] dollars per problem. Is that good? Well,
[00:39] I got to say, that is incredibly super
[00:42] good. Why? Well, these are decades-old
[00:45] problems that were not solved by anyone
[00:47] yet. The other line of criticism I hear
[00:49] is that this did not do fundamentally
[00:52] new things. Is that a problem?
[00:55] I think not. Why? Well, let's look back
[00:58] to 4 years ago. GPT-3. People said,
[01:01] "Well, it can't even add numbers
[01:04] together reliably." Then, 2 years ago,
[01:07] people said, "Well, it can't even solve
[01:09] high school competition problems
[01:11] reliably." Then, 1 year ago, people
[01:14] said, "It can't even win the
[01:16] Mathematical Olympiad gold medal
[01:18] reliably." And today, they are saying,
[01:20] "Well, it can't even solve 50-year-old
[01:24] unsolved problems reliably."
[01:27] Do you see where this is going? It is
[01:29] clear as day. Please apply the first law
[01:32] of papers here. It says, "Do not look at
[01:35] where we are. Look at where we will be
[01:37] two more papers down the line." And this
[01:39] result is absolutely amazing, stunning,
[01:42] even. So, how did they do it? How is
[01:45] that even possible? Dear fellow
[01:47] scholars, this is Two Minute Papers with
[01:49] Dr. Károly Zsolnai Fehér. Normally, you
[01:51] would reach out to some AI assistant to
[01:54] take a crack at it, but it won't solve
[01:56] it because they hallucinate and make
[01:58] things up. To avoid that, they make it
[02:00] use Lean, a formalized mathematical
[02:03] language where it's easy to check
[02:05] whether your proofs are correct. Is this
[02:07] new? Not at all.
[02:09] Everyone is doing that today. Okay, so
[02:12] what's new here? Look, first, a
[02:14] mathematician writes down the problem in
[02:16] Lean and the solution. The proof is left
[02:19] blank. Then, the AI agent tries to solve
[02:22] it. Of course, it fails. Too hard. Then,
[02:26] another AI checks it and says, "Mhm,
[02:28] this is not great." But, it also says
[02:31] why it's not great. But, here's the key,
[02:34] this guy right here. This is a cheaper
[02:37] judge AI that reads two previous
[02:39] solutions and picks a winner. Both
[02:42] solutions can be wrong, but it picks the
[02:45] one that is a bit better. Now, this is
[02:48] genius. Why? Well, hold on to your
[02:51] papers, fellow scholars, because it's
[02:53] kind of like a chess system where the
[02:55] solutions are the players and each of
[02:58] these players gets an ELO score, also
[03:01] named after Arpad Elo, fellow Hungarian.
[03:04] Look, sometimes we provide solutions,
[03:07] too. So, each proof now has a score. And
[03:10] now, we start again. But, not from
[03:13] scratch. No, no, no. We start out from
[03:16] the highest scoring bad solution. So,
[03:19] this is now a tournament. Do this over
[03:22] and over again. So cool. And now, we
[03:25] keep running and running this tournament
[03:27] until the validator says, "Yep, this one
[03:30] checks out." And then, we have a formal
[03:33] proof. Nailed it. This is incredible
[03:36] because it takes an unreliable AI, runs
[03:39] it over and over again, and it can lie
[03:42] its rear end off as much as it wants,
[03:45] and we still get a reliable system out
[03:48] of this. A reliable system built out of
[03:51] unreliable parts. I love that. And the
[03:54] fact that they put all this research out
[03:56] there in the open for free for all of
[03:59] us.
[04:00] Chef's kiss. Thank you so much for
[04:02] everyone who worked on this. What a time
[04:04] to be alive. But wait, interestingly,
[04:08] the story of AI so far has been that we
[04:11] make it smarter. Now, the story has
[04:13] changed. We don't need to make it
[04:15] smarter, we need to make the harness
[04:17] around it tighter. Give it a good judge.
[04:20] Let it a thousand times and it will
[04:23] slowly work out the right solution to
[04:26] incredibly hard problems. So here, the
[04:28] intelligence is not just in the model,
[04:31] but it is in the loop around it.
[04:34] Everyone is experimenting with different
[04:36] kinds of loops and it is super fun. I do
[04:39] it too on lambda. Okay, not even this
[04:41] technique is perfect. Limitations. In
[04:44] other words, the stuff that you don't
[04:45] hear about in mainstream media. So one,
[04:49] why not test on the full 1200 Erdős
[04:52] problems? Well, there is a little
[04:54] selection bias here. I think they took a
[04:57] subset of 350 that was easier to
[05:00] formalize. Is that a problem? In my
[05:03] eyes, not at all.
[05:04] You got to start somewhere. Let's not be
[05:06] one of those people that say, well, it
[05:09] can't even solve the 50-year-old
[05:11] unsolved problems reliably. What it has
[05:14] achieved is incredible. Now, two,
[05:17] smaller models solved zero problems.
[05:19] Zero. Nothing. You still need a beefy AI
[05:23] system at the core. That is an
[05:25] interesting case because people keep
[05:27] showing these benchmarks where the super
[05:29] fast cheap model is just a couple
[05:32] percentage points away from the
[05:33] frontier. And whenever I try them, they
[05:36] always seem a great deal weaker. This
[05:39] seems to reinforce that. Also, people
[05:41] will probably start thinking, do I use a
[05:44] larger model with fewer tournament
[05:47] rounds or do I use a smaller one with
[05:50] more? Assume that they cost the same.
[05:53] Interesting question. Now, where does
[05:55] this put us? Well, an AI just solved
[05:58] nine math problems that no human could
[06:00] crack in 56 years for a couple of
[06:04] hundred dollars each, and they did it by
[06:06] letting an unreliable AI fail thousands
[06:09] of times against a judge that cannot
[06:12] lie. And we went from can't even add
[06:15] numbers to solving decades-old open
[06:18] problems in the span of four years. And
[06:21] I think that is insane. But, limitations
[06:24] apply. Also, models used to be the only
[06:27] thing that matters. Now, harnesses,
[06:29] loops around them, also matter. Now, I
[06:32] recently talked to Pushmeet, one of the
[06:34] leaders of the project, and he's
[06:36] amazing. I am just a student who loves
[06:39] to travel the world and tries to learn
[06:41] from incredible scientists like him and
[06:44] bring that knowledge to you fellow
[06:45] scholars. And it [clears throat] is a
[06:47] huge honor for me to be able to talk
[06:49] about it to such a super smart audience
[06:51] as you fellow scholars. Subscribe and
[06:54] hit the bell if you feel that this is
[06:56] the way of doing it. Thank you so much
[06:58] for being with me all these years and
[07:00] over more than a thousand videos. We
[07:03] need new tools for the era of LLMs, and
[07:07] Weights & Biases now has weave, a
[07:09] lightweight toolkit to confidently
[07:11] iterate on LLM applications. Use traces
[07:14] to debug how data flows through each
[07:17] step of your app, and use evaluations to
[07:19] measure your progress. It is the best.
[07:22] Try it out now at wnb.me/papers,
[07:27] or click the link in the description
[07:29] below.
⚡ Saved you time reading this? Transcribe any YouTube video for free — no signup needed.