Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Large language models (LLMs) are vulnerable to jailbreaking attacks, while conventional safety filters often cause excessive rejection of benign queries. Method: This paper proposes the Answer-Then-Check (ATC) safety alignment paradigm: the model first generates an answer, then autonomously evaluates its safety within the reasoning chain and produces a compliant alternative response—bypassing the limitations of post-hoc filtering. Contribution/Results: We introduce the Reasoned Safety Alignment (ReSA) dataset, comprising 80K samples that jointly train chain-of-thought reasoning and safety judgment. With only 500 fine-tuning samples, ATC achieves near-full alignment performance. Crucially, it preserves strong performance on core benchmarks—including MMLU, MATH500, and HumanEval—while significantly improving jailbreak resistance and reducing over-rejection, thereby advancing safety and utility toward the Pareto frontier.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called Answer-Then-Check, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to directly answer the question in their thought and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K examples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates on over-refusal benchmarks. Notably, the model fine-tuned with ReSA maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform safe completion. Unlike post-hoc methods that can only reject harmful queries, our model can provide helpful and safe alternative responses for sensitive topics (e.g., self-harm). Furthermore, we discover that training on a small subset of just 500 examples can achieve comparable performance to using the full dataset, suggesting that safety alignment may require less data than previously assumed.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM robustness against jailbreak attacks

Mitigating jailbreaking problems before final answer production

Reducing over-refusal rates while maintaining safety capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Answer-Then-Check safety alignment approach

Reasoned Safety Alignment dataset construction

Critical safety evaluation before response generation

🔎 Similar Papers

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks