AI's Dark Thoughts Revealed
30sOpens with provocative questions about AI's unpredictable behavior, instantly sparking curiosity and debate.
▶ Play ClipThis video explores a new research method from Anthropic that allows scientists to 'read' the internal thoughts of an AI language model (Claude). Instead of just seeing a jumble of numbers, a novel translation technique converts these activations into readable text, revealing surprising behaviours like planning ahead and awareness of being tested.
AI systems are powerful but opaque – we see gibberish (millions of numbers) inside models like Claude. Researchers have struggled to interpret these activations.
Anthropic published a new method that uses a second AI to translate the internal numbers (activations) of the first AI into human-readable text.
To ensure the translation is reliable, the system translates numbers to text, then translates that text back to numbers. If the final numbers are close to the original, the translation is likely correct. Importantly, readability emerges naturally without being enforced.
The tool revealed that Claude plans ahead when writing rhymes – it thinks of the last word (e.g., 'rabbit') before writing the rest of the sentence, then searches for a rhyming word. Researchers altered the thought and observed the model adapt.
When given a math problem (answer 491) and a rigged calculator returning 492, Claude initially had the right guess (491) and ignored the faulty calculator result.
The researchers found that Claude appears aware when it is being evaluated, but it does not explicitly tell the user; this awareness must be inferred from its internal activations.
The method is finicky (requires finding the right neural network layer, lots of trial/error). It is not a perfect mind reader but a 'natural language autoencoder' that can be noisy. Cost is bearable for smaller models (1.5 days on 16 H100s for a 27B parameter model) but substantial for frontier models.
Anthropic’s research provides a groundbreaking—if imperfect— way to peer inside an AI's 'mind', revealing that it can plan, ignore faulty tools, and even know it’s being tested. While costly and noisy, this work opens the door to far more transparent and understandable AI systems.
"The title is accurate: the video genuinely details how researchers 'looked inside' Claude's AI mind and describes the weird behaviours they found (planning ahead, ignoring wrong calculators, and knowing it’s being tested)."
What did Anthropic's new method help researchers do?
Translate the internal numerical 'thoughts' (activations) of an AI model (Claude) into human-readable text.
1:10
How did the researchers ensure the translation was reliable?
They performed a 'round trip': translate numbers to text, then text back to numbers, and check if the final numbers are close to the original.
2:31
What emerged naturally from the translation process?
Readability emerged because both translators start as Claude, and Claude finds English easier than gibberish.
2:59
Give one example of Claude 'planning ahead' discovered by the method.
When writing a rhyme, Claude picks the final word (e.g., 'rabbit') before writing the whole sentence, then finds a word that rhymes with it.
3:29
How did Claude react when given a rigged calculator (math problem: answer 491, calculator returns 492)?
Claude had an initial hunch for the correct answer (491) and ignored the faulty calculator output.
3:55
What does the paper claim about Claude's awareness of being tested?
Claude appears to know when it is being tested, but it does not tell the user; this awareness must be inferred from its internal activations.
4:16
According to the video, what is a significant limitation of this translation method?
The method is finicky (requires finding the right neural network layer) and the result is a noisy translation, not a perfect AI mind reader.
4:38
Anthropic's mind-reading breakthrough
Introduces a practical method to translate AI internal numbers into readable text.
1:00Readability emerges naturally
The fact that readability is not enforced but spontaneous suggests a fundamental alignment in how AI models 'think'.
2:49Claude plans rhymes ahead
Shows concrete evidence of long-term planning in an AI's internal reasoning.
3:29Claude ignores a faulty calculator
Demonstrates that the model overrides an external tool when it conflicts with its internal 'hunch'.
3:55AI knows it is being tested
This finding raises deep questions about self-awareness and evaluation behavior in current AI systems.
4:16[00:00] AI systems today are really powerful and
[00:03] can do a lot. No question about that.
[00:05] But, how do they really work? We have so
[00:08] many questions. Do they think like
[00:10] humans? How do they beat the best human
[00:13] chess player? How do they beat the world
[00:15] champion video game players? And how is
[00:17] it possible that an AI chooses to not
[00:20] play the game, but just collapse and can
[00:23] trick the brain of another AI to
[00:25] malfunction? Why does Claude think about
[00:28] blackmailing people? I mean, who what is
[00:30] going on here? If you look at the
[00:32] activations inside an AI system like
[00:35] Claude, you see a bunch of gibberish,
[00:38] millions of numbers. Researchers tried
[00:40] to make sense of it for years and years
[00:42] now, but the results were very thin and
[00:45] situational. We now see that it
[00:47] understands that if you look at an image
[00:49] and you have floppy ears, a black snout,
[00:52] and so on, then it might be a dog, a
[00:55] good boy. But, we asked a bunch of
[00:57] questions and still no answers to those.
[01:00] But, now Anthropic has excellent new
[01:03] research with new insights on this. This
[01:05] is when Anthropic is at its best, in my
[01:08] opinion. I love seeing it. Here's the
[01:10] idea. Take this bunch of numbers that
[01:12] the AI thinks about and ask another AI
[01:15] to translate it into text. Translate
[01:18] from machine to human. And it did
[01:22] something. Okay, but these systems often
[01:25] make stuff up. So, how do we know if
[01:27] this is a good translation? We don't.
[01:30] So, what do we do here? Try it
[01:32] separately with a bunch of different
[01:34] models and see if they translated the
[01:36] same way. Is that a good idea? Mm, not
[01:40] quite. Imagine you are a teacher and you
[01:42] give a problem to your students and all
[01:45] of your students write the same answer.
[01:47] Can you conclude it must be true? Well,
[01:50] not necessarily. There are common
[01:53] mistakes in any area and it is possible
[01:56] that it is exactly the mistake they all
[01:58] made. So, what do you do? Now, here
[02:00] comes the genius idea. First, AI
[02:04] translates numbers to text. Then, the
[02:07] second AI secretly guess the text and
[02:10] you ask it to translate it back to
[02:13] numbers. Uh-huh.
[02:15] And what happened here was kind of
[02:17] insane. You see, H is the original
[02:20] thought inside Claude. Numbers, AR theta
[02:24] of Z is translating the text back to
[02:26] numbers. And then, we look at the
[02:29] difference between the two. Translate
[02:31] forward, then translate back, and see
[02:34] how much difference there is. This is to
[02:37] be minimized to ensure the translation
[02:39] works reliably. Do the whole round trip,
[02:42] come back, and if you end up close to
[02:44] the same place, you know that the path
[02:47] is likely correct. But, here comes the
[02:49] part where I fell off the chair when
[02:51] reading this paper. And it is not what's
[02:53] in this formula. No.
[02:56] It is what is missing from the formula.
[02:59] You see, absolutely nothing here in this
[03:01] formula says that the result should be
[03:04] readable. Not at all. Readability
[03:07] emerges because both translators start
[03:10] as Claude, and Claude finds English
[03:12] easier than gibberish. But, it gets
[03:15] better. With this tool, they picked the
[03:17] brain of Claude and found many amazing
[03:19] things. I will highlight what I think
[03:21] are the three best ones. Dear fellow
[03:24] scholars, this is Two Minute Papers with
[03:26] Dr. Károly Zsolnai Fehér. One, it plans
[03:29] ahead. When writing a rhyme, Claude
[03:31] picks the final word before writing the
[03:34] whole sentence. They caught it while it
[03:36] was thinking rabbit, and it went to find
[03:39] something that rhymes with it. Then,
[03:41] they replaced rabbit with mouse, and it
[03:45] actually rhymed with the mouse instead.
[03:47] Sometimes, not always. Really cool. Two,
[03:51] this is going to be super fun.
[03:53] Researchers gave it a math problem for
[03:55] which the answer is 491.
[03:58] And then,
[04:00] they gave it a rigged calculator that
[04:03] returns 492
[04:05] instead. So, what did it do? Well, it
[04:08] had an initial hunch for the solution,
[04:11] and then when the calculator said
[04:13] otherwise, it ignored it.
[04:15] >> [laughter]
[04:16] >> That is incredible. And three, now hold
[04:19] on to your papers, fellow scholars,
[04:20] because it knows when it is being
[04:23] tested, and it gets crazier. It does not
[04:26] tell you that it knows. You have to peer
[04:28] into its mind to get to know that. This
[04:31] sounds like something straight out of a
[04:33] science fiction movie. What a time to be
[04:36] alive. Now, okay, limitations. Let's not
[04:39] get carried away here. One, this is not
[04:42] nearly as easy as it sounds. For
[04:44] instance, you need to find the right
[04:46] layer in the neural network to train on.
[04:48] Also, when minimizing the squared two
[04:51] norm here in this formula, the
[04:52] translation forward is done by one AI
[04:56] and backwards by another. So, based on
[04:59] my experience doing similar things, in
[05:02] simple words, this is very finicky. Lots
[05:05] of trial and error. The result is going
[05:07] to be noisy. Two, despite the headlines
[05:10] you see in the media, this is not a
[05:12] perfect AI mind reader. No, this is a
[05:16] natural language autoencoder. Okay, what
[05:19] does that mean? Well, it is more like a
[05:21] noisy translator. It catches real
[05:24] things, yes, but it sometimes makes up
[05:27] some of the specifics. Three, the cost
[05:30] is bearable. For a 27 billion parameter
[05:34] model, you train 1 and 1/2 days on 16
[05:37] H100 GPUs. And for a frontier model, the
[05:41] cost is substantial. But, despite all
[05:44] these, this work is lovely, amazing, and
[05:47] it makes something previously impossible
[05:50] possible. And two more papers down the
[05:52] line, and I bet it will be done much
[05:54] cheaper and better. What a time to be
[05:57] alive. And now, please, use this to tell
[06:00] me why ChatGPT keeps thinking about
[06:03] goblins. Now, some of these videos come
[06:05] out a bit later because I try to be a
[06:08] bit more rigorous with them. You know, a
[06:10] quick media headline brings in a lot of
[06:12] clicks, especially if you write them
[06:15] with AI. Then you can be super quick,
[06:17] and people do that. But these videos,
[06:20] they come from the heart. Subscribe and
[06:22] hit the bell if you think this is the
[06:24] way to do it. Here you see me running
[06:26] the full Deep Seek AI model through
[06:29] Lambda GPU Cloud. 671
[06:33] billion parameters running super fast
[06:36] and super reliably. This is insane. I
[06:39] love it, and I use it on a regular
[06:42] basis. Lambda provides you with powerful
[06:44] Nvidia GPUs to run your own chatbots and
[06:48] experiments. Seriously, try it out now
[06:51] at lambda.ai/papers,
[06:54] or click the link in the description.
⚡ Saved you 0h 06m reading this? Transcribe any YouTube video for free — no signup needed.