---
title: 'DeepMind’s New AI Found A Strange New Way To Think'
source: 'https://youtube.com/watch?v=Dkqzqw8rxXI'
video_id: 'Dkqzqw8rxXI'
date: 2026-06-28
duration_sec: 0
---

# DeepMind’s New AI Found A Strange New Way To Think

> Source: [DeepMind’s New AI Found A Strange New Way To Think](https://youtube.com/watch?v=Dkqzqw8rxXI)

## Summary

DeepMind's AlphaProof Nexus tackled 350 unsolved math problems by Paul Erdős, achieving a 95.7% failure rate by solving 9. Despite this low success rate, the AI's ability to solve decades-old unsolved problems at low cost is considered remarkably good.

### Key Points

- **AI's Performance on Unsolved Problems** [0:00] — DeepMind's AI addressed 350 of Paul Erdős's 1,000+ unsolved problems, succeeding on only 9 with a 95.7% failure rate. Cost was a few hundred dollars per solved problem.
- **Criticism and Progress Over Time** [0:44] — Past criticisms of AI (unable to add numbers, solve high school problems, win Olympiad) have been refuted step by step. Current criticism is inability to solve 50-year-old problems, suggesting rapid progress.
- **First Law of Papers** [1:29] — Encourages focusing on future potential (two more papers down the line) rather than current limitations. Present result is described as 'absolutely amazing'.
- **How the System Works** [2:00] — Uses Lean (formal mathematical language) to avoid AI hallucinations. A mathematician inputs problem and solution outline with blank proof. AI attempts, fails; another AI judges and provides feedback. A cheaper judge AI selects better solution from two, creating a tournament with ELO-style scoring. Process iterates until a formal proof is validated.
- **Unreliable AI, Reliable System** [3:36] — The system converts unreliable AI into reliable system by repeated tournament iterations and a trustworthy judge, enabling solution of hard problems without requiring AI to be always correct.
- **Shift from Smarter AI to Better Harness** [4:08] — Current paradigm: enhancing the 'harness' (loop) around AI, not just making AI smarter. The intelligence is in the loop, not solely in the model.
- **Limitations: Selection Bias and Model Size** [4:44] — 348 problems selected may be easier to formalize. Smaller AI models solved 0 problems, indicating need for large models. Trade-off: larger model vs. more tournament rounds for same cost.
- **Rapid Progress and New Focus** [6:21] — From 4 years ago (can't add numbers) to solving decades-old problems. Harnesses and loops now matter alongside models.

### Conclusion

DeepMind's AI demonstrated remarkable progress by solving unsolved math problems through an innovative tournament system that harnesses an unreliable AI with a trustworthy judge, highlighting a shift from improving model intelligence to optimizing the surrounding harness.

## Transcript

DeepMind's new AI just did something
amazing, or did it? You see, there was a
legendary mathematician called Paul
Erdős, fellow Hungarian, who left more
than a thousand open problems to the
world to solve. Look, we Hungarians have
a lot of problems.
We got to contribute somehow, and this
is our way of doing it. Now, DeepMind's
new AI called AlphaProof Nexus tried to
solve about 350 of them and came up with
a 95.7%
failure rate. Basically, it solved nine,
and it only cost a couple hundred
dollars per problem. Is that good? Well,
I got to say, that is incredibly super
good. Why? Well, these are decades-old
problems that were not solved by anyone
yet. The other line of criticism I hear
is that this did not do fundamentally
new things. Is that a problem?
I think not. Why? Well, let's look back
to 4 years ago. GPT-3. People said,
"Well, it can't even add numbers
together reliably." Then, 2 years ago,
people said, "Well, it can't even solve
high school competition problems
reliably." Then, 1 year ago, people
said, "It can't even win the
Mathematical Olympiad gold medal
reliably." And today, they are saying,
"Well, it can't even solve 50-year-old
unsolved problems reliably."
Do you see where this is going? It is
clear as day. Please apply the first law
of papers here. It says, "Do not look at
where we are. Look at where we will be
two more papers down the line." And this
result is absolutely amazing, stunning,
even. So, how did they do it? How is
that even possible? Dear fellow
scholars, this is Two Minute Papers with
Dr. Károly Zsolnai Fehér. Normally, you
would reach out to some AI assistant to
take a crack at it, but it won't solve
it because they hallucinate and make
things up. To avoid that, they make it
use Lean, a formalized mathematical
language where it's easy to check
whether your proofs are correct. Is this
new? Not at all.
Everyone is doing that today. Okay, so
what's new here? Look, first, a
mathematician writes down the problem in
Lean and the solution. The proof is left
blank. Then, the AI agent tries to solve
it. Of course, it fails. Too hard. Then,
another AI checks it and says, "Mhm,
this is not great." But, it also says
why it's not great. But, here's the key,
this guy right here. This is a cheaper
judge AI that reads two previous
solutions and picks a winner. Both
solutions can be wrong, but it picks the
one that is a bit better. Now, this is
genius. Why? Well, hold on to your
papers, fellow scholars, because it's
kind of like a chess system where the
solutions are the players and each of
these players gets an ELO score, also
named after Arpad Elo, fellow Hungarian.
Look, sometimes we provide solutions,
too. So, each proof now has a score. And
now, we start again. But, not from
scratch. No, no, no. We start out from
the highest scoring bad solution. So,
this is now a tournament. Do this over
and over again. So cool. And now, we
keep running and running this tournament
until the validator says, "Yep, this one
checks out." And then, we have a formal
proof. Nailed it. This is incredible
because it takes an unreliable AI, runs
it over and over again, and it can lie
its rear end off as much as it wants,
and we still get a reliable system out
of this. A reliable system built out of
unreliable parts. I love that. And the
fact that they put all this research out
there in the open for free for all of
us.
Chef's kiss. Thank you so much for
everyone who worked on this. What a time
to be alive. But wait, interestingly,
the story of AI so far has been that we
make it smarter. Now, the story has
changed. We don't need to make it
smarter, we need to make the harness
around it tighter. Give it a good judge.
Let it a thousand times and it will
slowly work out the right solution to
incredibly hard problems. So here, the
intelligence is not just in the model,
but it is in the loop around it.
Everyone is experimenting with different
kinds of loops and it is super fun. I do
it too on lambda. Okay, not even this
technique is perfect. Limitations. In
other words, the stuff that you don't
hear about in mainstream media. So one,
why not test on the full 1200 Erdős
problems? Well, there is a little
selection bias here. I think they took a
subset of 350 that was easier to
formalize. Is that a problem? In my
eyes, not at all.
You got to start somewhere. Let's not be
one of those people that say, well, it
can't even solve the 50-year-old
unsolved problems reliably. What it has
achieved is incredible. Now, two,
smaller models solved zero problems.
Zero. Nothing. You still need a beefy AI
system at the core. That is an
interesting case because people keep
showing these benchmarks where the super
fast cheap model is just a couple
percentage points away from the
frontier. And whenever I try them, they
always seem a great deal weaker. This
seems to reinforce that. Also, people
will probably start thinking, do I use a
larger model with fewer tournament
rounds or do I use a smaller one with
more? Assume that they cost the same.
Interesting question. Now, where does
this put us? Well, an AI just solved
nine math problems that no human could
crack in 56 years for a couple of
hundred dollars each, and they did it by
letting an unreliable AI fail thousands
of times against a judge that cannot
lie. And we went from can't even add
numbers to solving decades-old open
problems in the span of four years. And
I think that is insane. But, limitations
apply. Also, models used to be the only
thing that matters. Now, harnesses,
loops around them, also matter. Now, I
recently talked to Pushmeet, one of the
leaders of the project, and he's
amazing. I am just a student who loves
to travel the world and tries to learn
from incredible scientists like him and
bring that knowledge to you fellow
scholars. And it [clears throat] is a
huge honor for me to be able to talk
about it to such a super smart audience
as you fellow scholars. Subscribe and
hit the bell if you feel that this is
the way of doing it. Thank you so much
for being with me all these years and
over more than a thousand videos. We
need new tools for the era of LLMs, and
Weights & Biases now has weave, a
lightweight toolkit to confidently
iterate on LLM applications. Use traces
to debug how data flows through each
step of your app, and use evaluations to
measure your progress. It is the best.
Try it out now at wnb.me/papers,
or click the link in the description
below.
