SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Large reasoning models (LRMs) are prone to activating unsafe reasoning paths under adversarial prompts; existing safety alignment methods often compromise reasoning depth and exhibit limited robustness against sophisticated jailbreaking attacks. This paper proposes an “early safety prompting” paradigm: injecting an ultra-lightweight (8-token) safety prompt at the very beginning of chain-of-thought reasoning, thereby decoupling safety enforcement from subsequent reasoning steps—no supervision of intermediate or final outputs is required. The method enables zero-shot deployment and exposes a critical generalization failure of mainstream alignment techniques on LRMs. Evaluated on DeepSeek-R1-Distill-Llama-8B, it reduces harmful responses by 90.0% and achieves an 83.3% jailbreak mitigation rate, while incurring only ∼1/296 and ∼1/314 the computational overhead of Direct Refusal and SafeChain, respectively.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.

Problem

Research questions and friction points this paper is trying to address.

Preventing harmful reasoning in chain-of-thought models

Reducing harmful outputs without degrading reasoning depth

Blocking jailbreak attacks with lightweight alignment methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight alignment with Safety Primer

Reduces harmful outputs by 90%

Blocks 83.3% of jailbreak attempts

🔎 Similar Papers

No similar papers found.