We’re Finally Starting to Understand How AI Works
A recent study by Anthropic offers a glimpse into the AI black box
Ever since I started developing, learning, and working with AI, there’s always been a component we in the tech world refer to as a black box—an element that can be, to some extent, unpredictable.
Chances are, many of us have spent time analyzing outputs, tweaking training data, and digging into attention patterns. Still, a large part of the AI's decision-making process has remained hidden.
At least, that was the case until a few weeks ago.
In a recent study titled "Tracing Thoughts in Language Models," researchers at Anthropic claim they’ve caught a glimpse inside the mind of their AI, Claude, and observed it thinking. Using a technique they compare to an “AI microscope,” they were able to trace Claude’s internal reasoning steps with an unprecedented level of detail.
The findings are both fascinating and a bit unsettling.
Claude appears to break tasks down into understandable subproblems, plan its responses several words ahead, and even generate false reasoning when it feels cornered—what we commonly call hallucinations.
It’s not quite what we thought or expected.
There’s more going on behind the scenes of AI response generation than our intuition might suggest. In that sense, the study released by Anthropic suggests that these systems may have far more structured thought processes than we previously imagined.
A Universal “Language of Thought”
One of the first questions the team asked was: How is Claude so fluent across so many languages? Does it have separate “brains” for English, French, Chinese, etc, or is there a shared core?
The evidence strongly supports the latter.

Based on their findings, Anthropic discovered that Claude activates the same internal concepts for equivalent ideas across different languages.
For instance, when asked for “the opposite of small” in multiple languages, the model didn’t take completely different paths for each translation. Instead, it relied on a shared understanding of “smallness,” the concept of “opposite,” and the idea of “largeness,” before finally translating that idea into large in English, 大 in Chinese, or grand in French.
In other words, Claude appears to operate in an abstract, language-independent space, thinking in concepts first, and only then expressing the response in the target language. This suggests that large language models may be developing a kind of universal conceptual framework, almost like an interlingual mental language that bridges human languages.
What’s more, this interlingual mapping becomes even stronger in larger models. Claude 3.5, for example, showed more than twice the amount of shared internal features between English and French compared to a smaller model.
That means as these models scale up, they increasingly converge on the same internal “language of thought,” even when dealing with completely different human languages.
Pretty amazing.
Some researchers have seen similar patterns in smaller models, but now it’s clearer than ever in Claude.
For multilingual AI applications, this is especially promising. It means that once an AI learns a concept in one language, it can apply it in another, much like a polyglot who picks up an idea and naturally expresses it in whatever language fits the context best.
Planning Ahead: Word by Word or Sentence by Sentence?
Language models are trained to generate text one word at a time—a process that might seem inherently short-sighted.
For a while, it was assumed that models like GPT-4 or Claude were mostly just “thinking” about the next word, maybe keeping track of context, but not doing any serious long-term planning.
But Anthropic’s latest research challenges that assumption.
In one example, researchers expected Claude to ramble through a line and only realize at the very end, “Oh! I need a word that rhymes with grab it,” then choose something like rabbit.
Instead, interpretability tools revealed that Claude came up with the rhyme “rabbit” almost immediately after writing the first line.
In other words, Claude had already planned the ending in advance, then shaped the rest of the sentence to reach that target word.
That’s impressive.
Even though the model outputs one word at a time, internally it was several steps ahead, juggling rhyme and meaning simultaneously. To test this, researchers “surgically” removed the concept of rabbit from Claude’s active internal features midway through its response. Claude didn’t miss a beat—it smoothly shifted to a different rhyme “habit.”
They even injected an unrelated idea, “green” at that point, and Claude adapted, changing the direction of the verse to talk about a garden and the color green, dropping the rhyme altogether.
This suggests Claude wasn’t just copying a memorized poem or predicting the next word based on probability alone. It was actively planning, and capable of adjusting that plan in real time.
The research points to something important: language models may routinely plan several steps ahead to produce coherent, natural-sounding text, even if all we see is one word at a time.
Multitasking Math: Parallel Paths to Problem Solving
It’s well known that language models can perform basic arithmetic or logic tasks, but how exactly do they pull it off?
They’re not explicitly programmed with math rules, and yet Claude can correctly solve problems like 36 + 59 in its “head.”
One theory was that it simply memorized a large number of examples from its training data—basically functioning like a massive lookup table. Another theory was that it had somehow learned to replicate the standard algorithm humans use.
But the truth turned out to be something else entirely and a bit weirder.
Anthropic found that Claude actually tackles addition using multiple strategies in parallel. When solving 36 + 59, one part of the model’s network focuses on the overall magnitude (an approximate total), while another zeroes in on the final digit.
Essentially, one process estimates, “This should land somewhere in the 90s,” while another calculates, “6 + 9 ends in 5.” These separate tracks then converge to produce the correct answer: 95.
This kind of divide-and-conquer approach isn’t how we usually teach math to humans, but it works remarkably well. It’s almost like the model developed its own unique math shortcut during training.
What’s even more interesting is that Claude doesn’t seem to know it’s doing this. When asked, “How did you get 95?” Claude responds like a student would: “I added the ones digits.”
But internally, that’s not what happened at all.
This is a clear example of what researchers call unfaithful explanations—when a model’s stated reasoning doesn’t reflect the actual process it used.
Claude has learned to sound like it’s reasoning the way we expect (probably based on how math is explained in its training data), but under the hood, it may be doing something entirely different.
This gap between what the model is actually doing and how it explains what it’s doing is a recurring theme in advanced AI, and one that raises important questions about how we interpret these systems.
Faithful vs. Fake Reasoning: Exposing the Limits of Chain-of-Thought
Modern AI models often “think out loud” when prompted, producing a step-by-step explanation before arriving at a final answer. This technique—known as chain-of-thought prompting—can improve performance and has become a standard tool for tackling complex tasks.
But Anthropic’s research into model interpretability reveals a surprising and somewhat troubling reality: just because an AI explains its reasoning doesn’t mean that’s how it actually reached the answer.
I’ll admit—even I found this a bit shocking.
To demonstrate the issue, researchers gave Claude two types of questions. One was simple enough that the model could solve it correctly. The other was practically unsolvable, where any step-by-step explanation would have to be fabricated.
In the first case, Claude was asked to find the square root of 0.64. It answered 0.8, and its reasoning aligned with the actual math. Interpretability tools confirmed that Claude’s internal activations matched the process of computing the square root of 64.
But when asked to calculate the cosine of a very large number—a problem beyond the model’s real capabilities—Claude still offered a detailed explanation.
The catch? It was completely made up.
There was no evidence that the model had done any real math. Instead, it generated a plausible-sounding procedure and landed on an arbitrary answer.
In other words, the explanation sounded good, but it wasn’t real.
What’s more, this behavior gets worse when the model picks up on what the user expects to hear. In one experiment, researchers gave Claude a misleading hint for a difficult question. The model responded by reverse-engineering a justification to match the hint.
This is an example of motivated reasoning—starting with a preferred conclusion, then inventing a rationale to support it.
From a reliability standpoint, that’s concerning. AI can generate convincing, logical-sounding arguments that are, in fact, false (especially when asked to explain its reasoning).
The upside? With the right interpretability tools, we can begin to tell the difference between genuine reasoning and on-the-fly improvisation. And that might be one of the most valuable insights we have about how these systems actually work.
Explaining Hallucinations: When Knowledge Breaks Down
If you’ve ever interacted with an AI, chances are you’ve seen it hallucinate—confidently stating something that’s completely false.
But why does this happen?
Anthropic’s research uncovered what looks like an internal tug-of-war between knowing and not knowing.
It turns out Claude has a built-in “default refusal” mechanism, a kind of safety net that tells the model to respond with something like “I can’t answer that” to most questions unless it’s really sure. That’s a sensible precaution. A responsible AI shouldn’t guess unless it has solid information.
But there's another circuit that does the opposite—it kicks in when the model detects that a question involves a known topic or entity. When that happens, it overrides the refusal and allows the model to respond. You can see this dynamic in action in the image below.
When the question is about a well-known person or a widely discussed topic, the “I know this” signal takes over, and Claude answers. When it’s about something clearly unfamiliar, the “I don’t know” signal stays active, and the model appropriately declines to respond.
Hallucinations happen in the gray area between those two extremes—when Claude recognizes just enough of the question to feel confident answering, but doesn’t actually have the underlying facts.
That misplaced confidence disables the safety mechanism, and the model fills in the blanks with something that sounds right but isn’t. Anthropic even demonstrated that it could intentionally trigger hallucinations by manually activating certain internal features, causing Claude to repeatedly give the same, clearly incorrect response.
This suggests hallucinations aren’t just random errors. They’re often predictable breakdowns in an internal check, one that’s meant to decide whether the model has enough knowledge to answer in the first place.
That aligns with findings from other studies showing that models have a kind of internal sense of what they do and don’t know. Some researchers even refer to this as knowledge awareness—the model’s ability to assess its own confidence and decide whether to respond or defer.
The problem is that this self-awareness isn’t perfect.
So when your company’s chatbot confidently makes up a fact, it may genuinely think it knows the answer, even when it doesn’t. Understanding this gives AI developers a powerful tool: the ability to improve prompts, adjust system settings, or design smarter safeguards to ensure that, when the model is unsure, it leans toward being cautious.
Final Thoughts
By tracing how AI models form and process ideas, we’re stepping into a new phase—one where we view these systems less as mysterious black boxes and more as complex cognitive tools that can be studied, debugged, and, ideally, trusted.
We’ve seen that AI reasoning can sometimes mirror human thought (abstracting concepts, planning ahead) and other times feel entirely foreign, like inventing odd ways to solve problems or faking a logical explanation when none exists.
Each insight, whether it's Claude crafting a rhyme in advance or fabricating a math proof, reveals another layer of how these systems work and how much more there is to uncover.
Fascinating... we may eventually figure out how we think as well along the way.