Reasoning Structure Matters for Safety Alignment of Reasoning Models

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
This work addresses the inherent vulnerability of large reasoning models to generate harmful content in response to malicious queries, a risk rooted in their reasoning architecture itself. The study presents the first systematic analysis of the intrinsic relationship between reasoning structure and model safety, introducing a lightweight safety alignment paradigm that eschews reinforcement learning. By restructuring the reasoning process and applying AltTrain—a supervised fine-tuning approach requiring only on the order of one thousand examples—the method achieves strong safety alignment across diverse model architectures and scales. Moreover, it demonstrates robust generalization performance on reasoning, question answering, summarization, and multilingual tasks, highlighting its effectiveness and broad applicability without compromising model capabilities.

Technology Category

Application Category

📝 Abstract
Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks but often generate harmful responses to malicious user queries. This paper investigates the underlying cause of these safety risks and shows that the issue lies in the reasoning structure itself. Based on this insight, we claim that effective safety alignment can be achieved by altering the reasoning structure. We propose AltTrain, a simple yet effective post training method that explicitly alters the reasoning structure of LRMs. AltTrain is both practical and generalizable, requiring no complex reinforcement learning (RL) training or reward design, only supervised finetuning (SFT) with a lightweight 1K training examples. Experiments across LRM backbones and model sizes demonstrate strong safety alignment, along with robust generalization across reasoning, QA, summarization, and multilingual setting.
Problem

Research questions and friction points this paper is trying to address.

reasoning structure
safety alignment
large reasoning models
harmful responses
malicious queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

reasoning structure
safety alignment
AltTrain
supervised finetuning
large reasoning models