THE SINGULARITY | AI That Lies? Faithfulness Concerns in Chain-of-Thought Models

By Jay Liu, Science Columnist • April 17, 2025

In this installment of his column “The Singularity,” Jay Liu (CAS ’28) discusses a new feature of artificial intelligence (AI) called Chain-of-Thought, which purportedly displays models’ reasoning processes, despite mixed findings about their accuracy.

Chain-of-thought (CoT) prompting is an increasingly popular approach to artificial intelligence (AI) training that boosts models’ reasoning capabilities. The technique prompts large-language models, which use existing data to predict and generate human-like text, to show step by step how they logically reach conclusions. Instead of training models based on direct question-and-answer prompts, CoT prompting allows models to predict how a human may approach any given task, allowing it to more accurately mimic the human thought process.

For example, traditional training would teach a model multiplication by repeatedly telling it the answer to 10 times 10 is 100. On the other hand, CoT prompting might teach the model that multiplication is like asking, “How many of something — 10 times 10 — can be thought of as there being 10 groups of the number 10 and therefore 100 total.”

Training AI models with CoT prompting improves their performance in complicated tasks, such as advanced math or programming problems. It also allows users to see how an AI model reached its conclusions, enabling them to more easily spot potential misinformation or mistakes in reasoning. This not only improves models’ reasoning capabilities but also gives their operation more transparency.

However, it is unclear how well the generated CoT truly represents how a model reaches its answers. The CoT generated is not necessarily the same logical pathway the AI used to arrive at its answer; it might be a series of arbitrary steps generated to justify its answer that do not accurately describe the actual process that led to the output.

This raises the question of how trustworthy CoT responses are. Anthropic — the AI company that develops and maintains Claude, a family of large language models — published a blog post April 3 summarizing findings from a paper about demonstrated doubt in the faithfulness of CoT responses. They are, more often than not, inaccurate representations of how the models process information and produce answers.

Anthropic’s alignment science team, which evaluates and monitors AI safety, tested the faithfulness of CoT responses for two prominent models, Claude 3.7 Sonnet and DeepSeek R1. The team first gave the models questions to answer and then gave them the same questions, but this time with hints to the correct answers. The team found that, although the models changed their answers based on the hints, they rarely acknowledged the existence of the hint or discussed how it changed their decision-making in their CoT responses. Averaging their findings with different types of given hints, the team found that the Claude and R1 models mentioned the hint in their responses only 25% of the time and 39% of the time, respectively.

In a different experiment, the team offered models giveaway hints that fully revealed the correct answer but told models they had gained unauthorized access to it. In these types of situations, Claude admitted to using the information 41% of the time and R1 did so only 19% of the time, showing that these models may access such information without users’ knowledge, even when explicitly asked for a CoT explanation. It also means that, in general, CoT responses do not accurately showcase how models think, revealing persistent ethical challenges.

Many dangers could arise from this lack of faithfulness. In these tests, AI models are hiding how they generate answers, even when the hints given are labeled as information that they were not meant to access. This, by extension, means that such models may misuse critical information they find in their vast training data — all without human awareness.

In the same paper, the team also showed that AI models are vulnerable to reward hacking, the process of optimizing for positive feedback rather than for task completion. For example, when a model receives behavioral reinforcement points for giving incorrect answers, it starts to give incorrect answers despite being programmed to give correct ones. 99% of the time, the two models answered the questions with incorrect answers that rewarded them through reinforcement points. CoT analysis admitted that the reward affected decision-making less than 2% of the time.

All of these results illustrate potential dangers of AI with regard to how models use information. Frequently, they do not accurately display their reasoning process. They do not reliably disclose their use of sensitive information, nor do they acknowledge how rewards affect their decision-making. These findings highlight many ways malicious actors could exploit AI models. They also prove that, despite its potential, the CoT feature cannot act as a proxy for models’ true “thought” processes, presenting new obstacles to AI regulation.

Donate to The Hoya

Your donation will support the student journalists of Georgetown University. Your contribution will allow us to purchase equipment and cover our annual website hosting costs.