🤖 AI Summary
This study challenges the prevailing assumption that chain-of-thought (CoT) reasoning inherently enhances model safety, systematically investigating its dual role in jailbreaking attacks. Formally analyzing satisfiability conditions, we provide the first theoretical proof that CoT can either suppress or amplify jailbreaking behavior—depending on reasoning-path properties. Leveraging this insight, we propose FicDetail, a novel jailbreaking method that exploits *detail hallucination* within CoT reasoning steps to trigger harmful content generation. Empirical evaluation on state-of-the-art reasoning models—including GPT-4o and Claude-3.5—demonstrates that FicDetail achieves significantly higher success rates than existing baselines, revealing critical safety blind spots inherent to the CoT paradigm. We further introduce the first formal model of harmfulness propagation in CoT, enabling fine-grained analysis of how unsafe outputs emerge from intermediate reasoning steps. Collectively, this work establishes reasoning-process-aware evaluation as a new dimension for large language model safety alignment.
📝 Abstract
Jailbreak attacks have been observed to largely fail against recent reasoning models enhanced by Chain-of-Thought (CoT) reasoning. However, the underlying mechanism remains underexplored, and relying solely on reasoning capacity may raise security concerns. In this paper, we try to answer the question: Does CoT reasoning really reduce harmfulness from jailbreaking? Through rigorous theoretical analysis, we demonstrate that CoT reasoning has dual effects on jailbreaking harmfulness. Based on the theoretical insights, we propose a novel jailbreak method, FicDetail, whose practical performance validates our theoretical findings.