🤖 AI Summary
This work identifies a pervasive “safety tax” in safety alignment of Large Reasoning Models (LRMs): while safety fine-tuning restores the model’s ability to refuse harmful requests, it substantially degrades reasoning performance. We systematically evaluate multiple LRMs across reasoning (GSM8K, MMLU) and safety (SafeBench) benchmarks, observing an average 5.2–13.7% drop in reasoning accuracy post-alignment. This constitutes the first empirical demonstration of an intrinsic trade-off between safety capability and reasoning capability. To enable fine-grained analysis of refusal behavior, we introduce DirectRefusal—a novel, rigorously curated safety evaluation dataset. Validated through extensive experiments, DirectRefusal enhances both the controllability and precision of safety fine-tuning. Our findings provide both theoretical grounding and practical tools for reconciling safety and reasoning performance in LRMs, advancing the principled development of aligned reasoning systems.
📝 Abstract
Safety alignment is an important procedure before the official deployment of a Large Language Model (LLM). While safety alignment has been extensively studied for LLM, there is still a large research gap for Large Reasoning Models (LRMs) that equip with improved reasoning capability. We in this paper systematically examine a simplified pipeline for producing safety aligned LRMs. With our evaluation of various LRMs, we deliver two main findings: i) Safety alignment can be done upon the LRM to restore its safety capability. ii) Safety alignment leads to a degradation of the reasoning capability of LRMs. The two findings show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline. The discovered trade-off, which we name Safety Tax, should shed light on future endeavors of safety research on LRMs. As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment. Our source code is available at https://github.com/git-disl/Safety-Tax.