Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

📅 2025-05-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study challenges the prevailing assumption that chain-of-thought (CoT) reasoning inherently enhances model safety, systematically investigating its dual role in jailbreaking attacks. Formally analyzing satisfiability conditions, we provide the first theoretical proof that CoT can either suppress or amplify jailbreaking behavior—depending on reasoning-path properties. Leveraging this insight, we propose FicDetail, a novel jailbreaking method that exploits *detail hallucination* within CoT reasoning steps to trigger harmful content generation. Empirical evaluation on state-of-the-art reasoning models—including GPT-4o and Claude-3.5—demonstrates that FicDetail achieves significantly higher success rates than existing baselines, revealing critical safety blind spots inherent to the CoT paradigm. We further introduce the first formal model of harmfulness propagation in CoT, enabling fine-grained analysis of how unsafe outputs emerge from intermediate reasoning steps. Collectively, this work establishes reasoning-process-aware evaluation as a new dimension for large language model safety alignment.

Technology Category

Application Category

📝 Abstract
Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.
Problem

Research questions and friction points this paper is trying to address.

Does CoT reasoning reduce jailbreak harmfulness?
Mechanism of CoT in jailbreak attacks is unclear
Proposing FicDetail to validate theoretical findings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes Chain-of-Thought reasoning's dual effects
Proposes novel jailbreak method FicDetail
Validates findings with practical performance
🔎 Similar Papers
No similar papers found.
Chengda Lu
Chengda Lu
Tsinghua University
NLP
X
Xiaoyu Fan
IIIS, Tsinghua University; Shanghai Qi Zhi Institute
Y
Yu Huang
Department of Statistics and Data Science, University of Pennsylvania
Rongwu Xu
Rongwu Xu
University of Washington
AIHuman-AINLPRL
J
Jijie Li
Data Agent Group, Beijing Academy of Artificial Intelligence
W
Wei Xu
IIIS, Tsinghua University; Shanghai Qi Zhi Institute