FuSaR: A Fuzzification-Based Method for LRM Safety-Reasoning Balance

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Large reasoning models (LRMs) often suffer from degraded safety during performance enhancement, exposing a fundamental trade-off between safety and reasoning capability. This paper proposes FuSaR, the first framework to explicitly reveal and model the competitive relationship between reasoning and safety abilities. FuSaR introduces a fuzzification-based safety-reasoning co-optimization mechanism: it semantically obfuscates hazardous entities and steps within reasoning chains and conducts alignment training with detoxified data. Extensive experiments across multiple open-source LRMs demonstrate that FuSaR significantly improves jailbreak resistance (average +23.6%) while preserving—often even enhancing—original reasoning performance. Compared to state-of-the-art baselines, FuSaR achieves superior overall effectiveness in balancing safety and reasoning fidelity.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have demonstrated impressive performance across various tasks due to their powerful reasoning capabilities. However, their safety performance remains a significant concern. In this paper, we explore the reasons behind the vulnerability of LRMs. Based on this, we propose a novel method to improve the safety of LLMs without sacrificing their reasoning capability. Specifically, we exploit the competition between LRM's reasoning ability and safety ability, and achieve jailbreak by improving LRM's reasoning performance to reduce its safety performance. We then introduce an alignment strategy based on Fuzzification to balance Safety-Reasoning (FuSaR), by detoxifying the harmful reasoning process, where both the dangerous entities and the dangerous procedures in the reasoning steps are hidden. FuSaR successfully mitigates safety risks while preserving core reasoning information. We validate this strategy through alignment experiments on several open-source LRMs using detoxified reasoning data. The results compared with existing baselines conclusively show that FuSaR is an efficient alignment strategy to simultaneously enhance both the reasoning capability and safety of LRMs.

Problem

Research questions and friction points this paper is trying to address.

Balance safety and reasoning in Large Reasoning Models

Mitigate safety risks without losing reasoning capability

Detoxify harmful reasoning processes in LRMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fuzzification-based alignment for safety-reasoning balance

Detoxifying harmful reasoning steps and entities

Enhancing reasoning while mitigating safety risks

🔎 Similar Papers

On the Challenges of Fuzzing Techniques via Large Language Models