IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of jointly optimizing safety, over-rejection rate, and response quality in large language models (LLMs), this paper proposes an intent-aware multi-tier safety framework. Methodologically: (1) We construct a large-scale annotated dataset and design an intent-driven, hierarchical safety classification mechanism; (2) We train a dedicated guard model to perform fine-grained intent recognition, format-constrained rewriting, and safety-aware query transformation; (3) We introduce a customized reinforcement learning strategy integrating rule-based engines with a multi-dimensional reward model—incorporating safety, fluency, and faithfulness metrics. Experiments demonstrate substantial improvements in defense capability across mainstream safety benchmarks (e.g., SafeBench, AdvBench) and jailbreaking attack scenarios, achieving an average 37.2% reduction in over-rejection rate while preserving—or even enhancing—response quality and user satisfaction.

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) has driven their adoption across diverse domains, yet their ability to generate harmful content poses significant safety challenges. While extensive research has focused on mitigating harmful outputs, such efforts often come at the cost of excessively rejecting harmless prompts. Striking a balance among safety, over-refusal, and utility remains a critical challenge. In this work, we introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning, multi-level safety classification, and query rewriting to neutralize potentially harmful intent in edge-case queries. Specifically, we first construct a comprehensive dataset comprising approximately 163,000 queries, each annotated with intent reasoning, safety labels, and rewritten versions. Supervised fine-tuning is then applied to equip the guard model with foundational capabilities in format adherence, intent analysis, and safe rewriting. Finally, we apply a tailored multi-reward optimization strategy that integrates rule-based heuristics and reward model signals within a reinforcement learning framework to further enhance performance. Extensive experiments show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly enhancing safety while effectively reducing over-refusal rates and improving the quality of responses.
Problem

Research questions and friction points this paper is trying to address.

Balancing safety and utility in LLM safeguards
Reducing over-refusal of harmless prompts
Neutralizing harmful intent in edge-case queries
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intent reasoning and multi-level safety classification
Supervised fine-tuning for format adherence and rewriting
Multi-reward optimization with reinforcement learning framework
🔎 Similar Papers
No similar papers found.
Y
Yuanzhe Shen
School of Computer Science, Fudan University, Shanghai, China
Z
Zisu Huang
School of Computer Science, Fudan University, Shanghai, China
Z
Zhengkang Guo
School of Computer Science, Fudan University, Shanghai, China
Y
Yide Liu
School of Computer Science, Fudan University, Shanghai, China
Guanxu Chen
Guanxu Chen
Shanghai Jiao Tong University
Trustworthy AIInterpretability
R
Ruicheng Yin
School of Computer Science, Fudan University, Shanghai, China
Xiaoqing Zheng
Xiaoqing Zheng
Fudan University
Natural Language Processing and Machine Learning
X
Xuanjing Huang
School of Computer Science, Fudan University, Shanghai, China