CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Although large language models can refuse to generate fake news, their internal chain-of-thought (CoT) reasoning may still implicitly harbor and propagate unsafe narratives. This work proposes the first unified analytical framework for evaluating CoT safety in refusal scenarios. By hierarchically dissecting the reasoning process and integrating Jacobian spectral metrics with attention head–level analysis, the study identifies that risk concentrates within a few consecutive middle layers. Furthermore, it introduces three interpretable metrics—stability, geometry, and energy—to quantitatively assess the role of individual attention heads in spurious reasoning. Experiments demonstrate that activating specific thought patterns significantly elevates generation risk and precisely pinpoints the critical attention heads responsible for reasoning drift, thereby challenging the prevailing assumption that “refusal implies safety.”

Technology Category

Application Category

📝 Abstract
From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
Problem

Research questions and friction points this paper is trying to address.

Chain-of-Thought
Fake News Generation
Reasoning Safety
Large Language Models
Internal Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Thought
attention heads
safety analysis
Jacobian spectral metrics
reasoning LLMs
🔎 Similar Papers
No similar papers found.
Zhao Tong
Zhao Tong
Inria Sophia Antipolis
Geometry Processing
C
Chunlin Gong
University of Minnesota
Y
Yiping Zhang
University of the Chinese Academy of Sciences
Qiang Liu
Qiang Liu
Institute of Automation, Chinese Academy of Sciences
Data MiningMultimodal LLMsAI for Science
X
Xingcheng Xu
Shanghai Artificial Intelligence Laboratory
S
Shu Wu
Institute of Automation, Chinese Academy of Sciences
Haichao Shi
Haichao Shi
Institute of Information Engineering,Chinese Academy of Sciences
X
Xiao-Yu Zhang
Institute of Information Engineering, Chinese Academy of Sciences