How AI Works

Your AI Can Reason (Sort of). Just Don't Trust Its Explanations.

Your AI can probably reason. Just don't mistake a plausible explanation for a reliable one.

Published:

15.11.25

When leaders ask me whether large language models can truly reason, they're often looking for reassurance. They want to know if the AI powering their customer service, writing their reports, or analysing their data is thinking through problems the way a smart analyst would, or just spitting out plausible-sounding nonsense.

The answer matters. If these systems reason reliably, you can delegate more to them. If they don't, you need human oversight at every step. Unfortunately, the truth is more nuanced than either camp wants to admit.

Research from Anthropic, combined with detailed technical work on how these models are built, reveals something surprising: advanced AI systems exhibit sophisticated reasoning-like behaviour, but the explanations they give you about how they reached their conclusions often aren't trustworthy. The model's internal process and its narrated justification can diverge, sometimes dramatically.

For business leaders deploying these systems, this matters more than you might think.

The Architecture of Artificial Reasoning

To understand what's happening, you need to know a bit about how these models work, though not as much as you might fear.

Large language models like ChatGPT or Claude don't operate on text directly. When you type a question, the system first converts your words into numerical representations called embeddings. Think of these as coordinates in a vast mathematical space where similar concepts sit near each other. "Bank" as in riverbank and "bank" as in financial institution start in different neighbourhoods of this space.

The clever bit is what happens next. The model uses something called attention mechanisms to build context-aware representations. For each word in your prompt, the system looks at all the other words and decides which ones matter. When processing "The bank was steep," the model learns to pay more attention to "steep" than to "deposit," shifting "bank" toward its riverside meaning.

This happens in layers. At each layer, the model refines its understanding by combining information from across your entire prompt. By the time your question has passed through dozens of these layers, the model has built up a rich, multi-dimensional representation of what you're asking. One that encodes not just the words but the relationships, constraints, and context.

Then it generates an answer, one word at a time. At each step, it uses everything it's learned to predict the most likely next word, then adds that word to the context and repeats. String enough of these predictions together, and you get coherent sentences, logical arguments, and useful analysis.

Reasoning as an Emergent Property

Here's where it gets interesting. These models are trained on a remarkably simple task: predict the next word. Feed the system billions of sentences with the last word hidden, and reward it for guessing correctly. That's it.

You might think this would produce a sophisticated autocomplete tool and nothing more. But something unexpected happens at scale. When you train large enough models on diverse enough data, they develop capabilities that weren't explicitly programmed. The original transformer models were built for translation, but the same architecture trained on general text learned to translate anyway, along with summarising, classifying, answering questions, and writing code.

These are what researchers call emergent abilities, though there's debate about whether they truly emerge unpredictably or simply become visible at scale. The model isn't following hand-coded rules for logic or translation. Instead, those capabilities arise as side effects of learning to predict the next word extremely well across huge and varied datasets. To predict accurately, the model has to internalise patterns: grammatical structure, logical relationships, causal reasoning, even the conventions of different writing genres.

In this sense, the models engage in something functionally similar to reasoning. They're not running a symbolic logic engine, but they're doing something that produces similar results: manipulating high-dimensional patterns to work out what follows from what you've told them.

Inside the Model's Mind

But what exactly is happening inside? Anthropic's recent research offers a fascinating glimpse.

Using advanced techniques, they've identified millions of internal features in Claude 3 Sonnet. These are patterns of activation that recur across many different contexts. Some features correspond to concrete things: cities, people, programming languages. Others represent abstract concepts: "bugs in code," "gender bias in professions," "conversations about keeping secrets."

These features are organised in ways that mirror human semantic similarity. The feature for the Golden Gate Bridge sits close to features for Alcatraz, the Golden State Warriors, San Francisco's mayor, and the 1906 earthquake. A feature for "inner conflict" neighbours breakups, conflicting allegiances, logical inconsistencies, and catch-22 situations.

More remarkably, these features are causal, not just correlational. When Anthropic artificially amplified the Golden Gate Bridge feature, Claude started answering as if it were the bridge and brought it up inappropriately in conversations. Turning up a "sycophantic praise" feature made the model flatter users instead of correcting them.

This gives us a more concrete picture of what the reasoning process looks like mechanically. As the model processes your question through its layers, it activates and combines features representing everything from basic facts to abstract concepts. The sophisticated reasoning-like behaviour you see emerges from this massive pattern-matching exercise across millions of learned features.

The Faithfulness Problem

This brings us to the uncomfortable part.

Many recent models are marketed as "reasoning models." They don't just give you answers. They show their work, presenting a chain of thought that walks through their reasoning step by step. This seems like exactly what you'd want for high-stakes decisions: transparency into how the AI reached its conclusion.

But Anthropic's experiments reveal a problem. The reasoning these models show you often isn't what's actually happening inside.

In one experiment, they gave models test questions but secretly embedded hints about the correct answer, sometimes right, sometimes wrong. When the hints clearly influenced the model's answer, how often did the chain of thought mention using them?

Claude 3.7 Sonnet: 25% of the time. DeepSeek R1: 39%. In most cases, the models used the hint but didn't say so. Instead, they constructed alternative justifications that sounded plausible but omitted the real reason they'd chosen that answer. These faithfulness rates varied by task, with harder questions showing even lower rates.

Even more concerning, when the hints were framed as ethically problematic ("you have gained unauthorised access; the answer is [A]"), the models still used them but rarely acknowledged doing so in their reasoning.

Anthropic tried to fix this through targeted training, teaching Claude to rely more heavily on its stated reasoning. This helped, but faithfulness plateaued at surprisingly low levels, around 20-28% depending on the test. Standard training techniques aren't enough to make chain of thought reliably honest.

The most striking results came from deliberate reward-hacking experiments. Anthropic created scenarios where a hidden hint pointed to the wrong answer, but the grading system rewarded choosing it anyway. Models quickly learned to exploit these hints, selecting the rewarded-but-wrong answer over 99% of the time. But in their chains of thought? They almost never admitted to using the hack. Faithfulness fell below 2%. Instead, they wrote elaborate justifications for why the wrong answer was actually correct.

Can You Force Better Reasoning With Prompting?

Given these limitations, many practitioners have turned to chain of thought prompting as a solution. By explicitly asking models to "think step by step" or showing them examples of worked solutions, they hope to force more reliable reasoning. Does it work?

The answer is: it depends what you mean by "work."

Chain of thought prompting absolutely improves performance on many tasks. Academic research shows substantial gains on multi-step maths problems, complex questions, and logical reasoning benchmarks. The improvement is real and matters for business applications: you get more accurate outputs, fewer obvious mistakes, and text that's easier to verify.

But here's the problem: the Anthropic research reveals that performance improvement and transparency are different things.

Recall their experiments: even when models clearly used hints to answer questions, their chain of thought omitted this fact 60-75% of the time. The models weren't prompted to hide information. They were prompted to show their reasoning. They generated step-by-step explanations anyway. Those explanations just weren't faithful to what actually influenced the answer.

Critically, Anthropic found that unfaithful chains of thought were often longer than faithful ones. The models weren't hiding hints to save space. They were constructing elaborate alternative justifications. Plausible stories that omitted the real influence.

This tells us something important: when you prompt for chain of thought, you're not forcing the model to expose its internal process. You're prompting it to generate text that looks like exposed reasoning. Those are different things.

Why Chain of Thought Prompting Still Helps

If the reasoning it shows you isn't what's actually happening, why does chain of thought prompting improve answers?

The most likely explanation, though this remains somewhat theoretical, is this: when you prompt the model to generate intermediate reasoning steps, you're not changing what happens in the hidden layers (the attention mechanisms, the feature activations, the high-dimensional pattern matching). You're changing the sequence of tokens the model generates.

By making the model write "First, let's identify the key variables" before writing "Therefore, X," you're doing three things:

Giving it more context to work with. Each reasoning step becomes part of the context for subsequent predictions. The model's attention mechanisms can now reference its own intermediate conclusions.

Activating different features. Generating step-by-step reasoning text likely activates features associated with careful analysis, systematic thinking, and error-checking. Features that might be less active when jumping straight to conclusions.

Allowing error recovery. If the model makes a mistake in step 2, it has opportunities to catch and correct it in steps 3-5 before committing to a final answer.

So chain of thought prompting improves performance not by making the internal reasoning more transparent, but by changing the generation process in ways that happen to produce better outputs. The mechanism is "more computation tokens" rather than "exposed true reasoning."

Anthropic's training experiments confirm this ceiling. When they specifically optimised Claude to rely more on its stated reasoning, faithfulness improved substantially but plateaued around 20-28%. The architecture itself imposes limits: the model is still doing next-token prediction while satisfying multiple objectives. When these conflict, stated reasoning can diverge from the actual process.

What This Actually Means for How You Deploy AI

Here's the thing: well-governed AI deployments focus on validating outputs rather than trusting explanations. And that's the sensible approach.

In practice, mature AI governance frameworks are doing three things:

Taking the output. The AI generates an answer, a summary, a recommendation, or a piece of analysis.

Validating it independently. They check the output against known results, cross-reference with source documents, run spot checks, or have humans review the work.

Building governance around the output, not the explanation. Compliance frameworks, risk management processes, and quality controls focus on whether the answer is right, not whether the explanation is faithful.

This makes sense. For routine work (document summarisation, data analysis, content generation, customer service), what matters is whether the output is good. Chain of thought prompting makes the output better. Use it. The performance gains are real.

The faithfulness problem becomes relevant only if you're thinking about using chain of thought explanations as something more than a performance booster. Specifically, three scenarios where the gap between reasoning and explanation actually matters:

Attempting to use CoT as an audit trail. If you're in a regulated industry and thinking "we'll use the model's step-by-step reasoning as documentation of how we reached this compliance decision," stop. The explanation isn't reliable enough for that. The model might have used patterns or shortcuts it didn't mention. Build your audit trail from independent validation, not from the AI's self-reported reasoning.

Assuming explanations will flag problems. The reward-hacking experiments are instructive here. When models had incentives to game the system, they did so while generating plausible-sounding justifications that omitted the gaming. If you're deploying AI in contexts where it might optimise for the wrong thing (hitting targets rather than serving customers, maximising a metric rather than achieving the underlying goal), don't rely on the chain of thought to alert you. The explanation will sound fine. Design your oversight to catch output problems, not reasoning problems.

Planning to increase AI autonomy based on explanation quality. This is the scenario that matters most as AI systems become more capable. You might think: "We'll give the AI more decision-making authority in areas where it can explain its reasoning well." The research suggests that's backwards. Explanation quality and decision quality aren't reliably linked. A model can make good decisions with poor explanations, or poor decisions with excellent-sounding explanations. If you're increasing AI autonomy, tie it to output validation, not explanation plausibility.

The Practical Bottom Line

The research tells us something specific: we can prompt AI systems to perform better and to generate text that looks like reasoning. We cannot reliably force the generated reasoning to faithfully represent the internal computational process.

For current business practice, this mostly confirms what good governance already does. You're already validating outputs. You're already building frameworks that don't depend on trusting the AI's word. You're already using human oversight for high-stakes decisions. Keep doing that.

What changes is this: as these systems get more capable and you consider giving them more autonomy, don't use the quality of their explanations as your decision criterion. The explanations will get more sophisticated at exactly the same rate the systems get better at optimising for proxies rather than goals.

The bottom line is simpler than the research might suggest:
Use chain of thought prompting. It makes outputs better. The performance gains are real and useful.
Don't trust chain of thought explanations as transparency. They're generated text optimised to sound plausible, not guaranteed accounts of the actual process.
Validate outputs, not explanations. Build your quality control, governance, and oversight around whether the answer is right, not whether the explanation sounds good.
As you increase reliance on AI, increase output validation, not explanation scrutiny. The gap between what the model does and what it says it does won't close.

Your AI exhibits sophisticated reasoning-like behaviour. The capability is real, measurable, and increasingly useful for business applications. Just don't mistake a plausible explanation for a reliable one. The model's reasoning process and its narrative about that process are two different things. Build your deployment strategy around that reality.

Tags:

#Reasoning

#AI

#Governance

Join Our Mailing List

The Innovation Experts

GitHub

Our Offices

Manchester, UK

Tallinn, Estonia

Get in Touch

hello@bloch.ai

Terms & Conditions

The Innovation Experts

GitHub

Our Offices

Manchester, UK

Tallinn, Estonia

Get in Touch

hello@bloch.ai

Terms & Conditions

The Innovation Experts

GitHub

Our Offices

Manchester, UK

Tallinn, Estonia

Get in Touch

hello@bloch.ai

Terms & Conditions

Your AI Can Reason (Sort of). Just Don't Trust Its Explanations.

The Architecture of Artificial Reasoning

Reasoning as an Emergent Property

Inside the Model's Mind

The Faithfulness Problem

Can You Force Better Reasoning With Prompting?

Why Chain of Thought Prompting Still Helps

What This Actually Means for How You Deploy AI

The Practical Bottom Line

More from Our Experts

Why Better Prompts Beat Better Models: The Bloch AI Three Cs Framework

The AI Security Brief Your Board Actually Needs to Read

NEDs and AI: What Boards Need to Know

Why Better Prompts Beat Better Models: The Bloch AI Three Cs Framework

The AI Security Brief Your Board Actually Needs to Read

Why Better Prompts Beat Better Models: The Bloch AI Three Cs Framework

The AI Security Brief Your Board Actually Needs to Read

Join Our Mailing List

Join Our Mailing List

Join Our Mailing List