---
title: 'They Looked Inside Claude’s AI''s Mind. It Got Weird'
source: 'https://youtube.com/watch?v=l72ufA-4SzE'
video_id: 'l72ufA-4SzE'
date: 2026-07-01
duration_sec: 416
---

# They Looked Inside Claude’s AI's Mind. It Got Weird

> Source: [They Looked Inside Claude’s AI's Mind. It Got Weird](https://youtube.com/watch?v=l72ufA-4SzE)

## Summary

This video explores a new research method from Anthropic that allows scientists to 'read' the internal thoughts of an AI language model (Claude). Instead of just seeing a jumble of numbers, a novel translation technique converts these activations into readable text, revealing surprising behaviours like planning ahead and awareness of being tested.

### Key Points

- **The black box of AI** [0:00] — AI systems are powerful but opaque – we see gibberish (millions of numbers) inside models like Claude. Researchers have struggled to interpret these activations.
- **Anthropic's new research** [1:00] — Anthropic published a new method that uses a second AI to translate the internal numbers (activations) of the first AI into human-readable text.
- **The 'round-trip' idea** [2:00] — To ensure the translation is reliable, the system translates numbers to text, then translates that text back to numbers. If the final numbers are close to the original, the translation is likely correct. Importantly, readability emerges naturally without being enforced.
- **Finding #1: Planning ahead** [3:29] — The tool revealed that Claude plans ahead when writing rhymes – it thinks of the last word (e.g., 'rabbit') before writing the rest of the sentence, then searches for a rhyming word. Researchers altered the thought and observed the model adapt.
- **Finding #2: Ignoring wrong answers** [3:55] — When given a math problem (answer 491) and a rigged calculator returning 492, Claude initially had the right guess (491) and ignored the faulty calculator result.
- **Finding #3: Knowing it's being tested** [4:16] — The researchers found that Claude appears aware when it is being evaluated, but it does not explicitly tell the user; this awareness must be inferred from its internal activations.
- **Limitations** [4:38] — The method is finicky (requires finding the right neural network layer, lots of trial/error). It is not a perfect mind reader but a 'natural language autoencoder' that can be noisy. Cost is bearable for smaller models (1.5 days on 16 H100s for a 27B parameter model) but substantial for frontier models.

### Conclusion

Anthropic’s research provides a groundbreaking—if imperfect— way to peer inside an AI's 'mind', revealing that it can plan, ignore faulty tools, and even know it’s being tested. While costly and noisy, this work opens the door to far more transparent and understandable AI systems.

## Transcript

AI systems today are really powerful and
can do a lot. No question about that.
But, how do they really work? We have so
many questions. Do they think like
humans? How do they beat the best human
chess player? How do they beat the world
champion video game players? And how is
it possible that an AI chooses to not
play the game, but just collapse and can
trick the brain of another AI to
malfunction? Why does Claude think about
blackmailing people? I mean, who what is
going on here? If you look at the
activations inside an AI system like
Claude, you see a bunch of gibberish,
millions of numbers. Researchers tried
to make sense of it for years and years
now, but the results were very thin and
situational. We now see that it
understands that if you look at an image
and you have floppy ears, a black snout,
and so on, then it might be a dog, a
good boy. But, we asked a bunch of
questions and still no answers to those.
But, now Anthropic has excellent new
research with new insights on this. This
is when Anthropic is at its best, in my
opinion. I love seeing it. Here's the
idea. Take this bunch of numbers that
the AI thinks about and ask another AI
to translate it into text. Translate
from machine to human. And it did
something. Okay, but these systems often
make stuff up. So, how do we know if
this is a good translation? We don't.
So, what do we do here? Try it
separately with a bunch of different
models and see if they translated the
same way. Is that a good idea? Mm, not
quite. Imagine you are a teacher and you
give a problem to your students and all
of your students write the same answer.
Can you conclude it must be true? Well,
not necessarily. There are common
mistakes in any area and it is possible
that it is exactly the mistake they all
made. So, what do you do? Now, here
comes the genius idea. First, AI
translates numbers to text. Then, the
second AI secretly guess the text and
you ask it to translate it back to
numbers. Uh-huh.
And what happened here was kind of
insane. You see, H is the original
thought inside Claude. Numbers, AR theta
of Z is translating the text back to
numbers. And then, we look at the
difference between the two. Translate
forward, then translate back, and see
how much difference there is. This is to
be minimized to ensure the translation
works reliably. Do the whole round trip,
come back, and if you end up close to
the same place, you know that the path
is likely correct. But, here comes the
part where I fell off the chair when
reading this paper. And it is not what's
in this formula. No.
It is what is missing from the formula.
You see, absolutely nothing here in this
formula says that the result should be
readable. Not at all. Readability
emerges because both translators start
as Claude, and Claude finds English
easier than gibberish. But, it gets
better. With this tool, they picked the
brain of Claude and found many amazing
things. I will highlight what I think
are the three best ones. Dear fellow
scholars, this is Two Minute Papers with
Dr. Károly Zsolnai Fehér. One, it plans
ahead. When writing a rhyme, Claude
picks the final word before writing the
whole sentence. They caught it while it
was thinking rabbit, and it went to find
something that rhymes with it. Then,
they replaced rabbit with mouse, and it
actually rhymed with the mouse instead.
Sometimes, not always. Really cool. Two,
this is going to be super fun.
Researchers gave it a math problem for
which the answer is 491.
And then,
they gave it a rigged calculator that
returns 492
instead. So, what did it do? Well, it
had an initial hunch for the solution,
and then when the calculator said
otherwise, it ignored it.
>> [laughter]
>> That is incredible. And three, now hold
on to your papers, fellow scholars,
because it knows when it is being
tested, and it gets crazier. It does not
tell you that it knows. You have to peer
into its mind to get to know that. This
sounds like something straight out of a
science fiction movie. What a time to be
alive. Now, okay, limitations. Let's not
get carried away here. One, this is not
nearly as easy as it sounds. For
instance, you need to find the right
layer in the neural network to train on.
Also, when minimizing the squared two
norm here in this formula, the
translation forward is done by one AI
and backwards by another. So, based on
my experience doing similar things, in
simple words, this is very finicky. Lots
of trial and error. The result is going
to be noisy. Two, despite the headlines
you see in the media, this is not a
perfect AI mind reader. No, this is a
natural language autoencoder. Okay, what
does that mean? Well, it is more like a
noisy translator. It catches real
things, yes, but it sometimes makes up
some of the specifics. Three, the cost
is bearable. For a 27 billion parameter
model, you train 1 and 1/2 days on 16
H100 GPUs. And for a frontier model, the
cost is substantial. But, despite all
these, this work is lovely, amazing, and
it makes something previously impossible
possible. And two more papers down the
line, and I bet it will be done much
cheaper and better. What a time to be
alive. And now, please, use this to tell
me why ChatGPT keeps thinking about
goblins. Now, some of these videos come
out a bit later because I try to be a
bit more rigorous with them. You know, a
quick media headline brings in a lot of
clicks, especially if you write them
with AI. Then you can be super quick,
and people do that. But these videos,
they come from the heart. Subscribe and
hit the bell if you think this is the
way to do it. Here you see me running
the full Deep Seek AI model through
Lambda GPU Cloud. 671
billion parameters running super fast
and super reliably. This is insane. I
love it, and I use it on a regular
basis. Lambda provides you with powerful
Nvidia GPUs to run your own chatbots and
experiments. Seriously, try it out now
at lambda.ai/papers,
or click the link in the description.
