🤖 AI Summary
Although large language models can refuse to generate fake news, their internal chain-of-thought (CoT) reasoning may still implicitly harbor and propagate unsafe narratives. This work proposes the first unified analytical framework for evaluating CoT safety in refusal scenarios. By hierarchically dissecting the reasoning process and integrating Jacobian spectral metrics with attention head–level analysis, the study identifies that risk concentrates within a few consecutive middle layers. Furthermore, it introduces three interpretable metrics—stability, geometry, and energy—to quantitatively assess the role of individual attention heads in spurious reasoning. Experiments demonstrate that activating specific thought patterns significantly elevates generation risk and precisely pinpoints the critical attention heads responsible for reasoning drift, thereby challenging the prevailing assumption that “refusal implies safety.”
📝 Abstract
From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.