Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

📅 2025-03-01

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work identifies a pervasive “safety tax” in safety alignment of Large Reasoning Models (LRMs): while safety fine-tuning restores the model’s ability to refuse harmful requests, it substantially degrades reasoning performance. We systematically evaluate multiple LRMs across reasoning (GSM8K, MMLU) and safety (SafeBench) benchmarks, observing an average 5.2–13.7% drop in reasoning accuracy post-alignment. This constitutes the first empirical demonstration of an intrinsic trade-off between safety capability and reasoning capability. To enable fine-grained analysis of refusal behavior, we introduce DirectRefusal—a novel, rigorously curated safety evaluation dataset. Validated through extensive experiments, DirectRefusal enhances both the controllability and precision of safety fine-tuning. Our findings provide both theoretical grounding and practical tools for reconciling safety and reasoning performance in LRMs, advancing the principled development of aligned reasoning systems.

Technology Category

Application Category

📝 Abstract

Safety alignment is an important procedure before the official deployment of a Large Language Model (LLM). While safety alignment has been extensively studied for LLM, there is still a large research gap for Large Reasoning Models (LRMs) that equip with improved reasoning capability. We in this paper systematically examine a simplified pipeline for producing safety aligned LRMs. With our evaluation of various LRMs, we deliver two main findings: i) Safety alignment can be done upon the LRM to restore its safety capability. ii) Safety alignment leads to a degradation of the reasoning capability of LRMs. The two findings show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline. The discovered trade-off, which we name Safety Tax, should shed light on future endeavors of safety research on LRMs. As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment. Our source code is available at https://github.com/git-disl/Safety-Tax.

Problem

Research questions and friction points this paper is trying to address.

Examines safety alignment impact on Large Reasoning Models (LRMs).

Identifies trade-off between reasoning and safety in LRMs.

Introduces Safety Tax concept and DirectRefusal dataset.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simplified pipeline for safety-aligned LRMs

Trade-off between reasoning and safety capabilities

Curated DirectRefusal dataset for safety alignment

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

2024-07-31arXiv.orgCitations: 5