Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?

📅 2025-05-23

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study challenges the prevailing assumption that chain-of-thought (CoT) reasoning inherently enhances model safety, systematically investigating its dual role in jailbreaking attacks. Formally analyzing satisfiability conditions, we provide the first theoretical proof that CoT can either suppress or amplify jailbreaking behavior—depending on reasoning-path properties. Leveraging this insight, we propose FicDetail, a novel jailbreaking method that exploits *detail hallucination* within CoT reasoning steps to trigger harmful content generation. Empirical evaluation on state-of-the-art reasoning models—including GPT-4o and Claude-3.5—demonstrates that FicDetail achieves significantly higher success rates than existing baselines, revealing critical safety blind spots inherent to the CoT paradigm. We further introduce the first formal model of harmfulness propagation in CoT, enabling fine-grained analysis of how unsafe outputs emerge from intermediate reasoning steps. Collectively, this work establishes reasoning-process-aware evaluation as a new dimension for large language model safety alignment.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.

Problem

Research questions and friction points this paper is trying to address.

Does CoT reasoning reduce jailbreak harmfulness?

Mechanism of CoT in jailbreak attacks is unclear

Proposing FicDetail to validate theoretical findings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes Chain-of-Thought reasoning's dual effects

Proposes novel jailbreak method FicDetail

Validates findings with practical performance

🔎 Similar Papers

JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models