IntentionReasoner: Facilitating Adaptive LLM Safeguards through Intent Reasoning and Selective Query Refinement

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Addressing the challenge of jointly optimizing safety, over-rejection rate, and response quality in large language models (LLMs), this paper proposes an intent-aware multi-tier safety framework. Methodologically: (1) We construct a large-scale annotated dataset and design an intent-driven, hierarchical safety classification mechanism; (2) We train a dedicated guard model to perform fine-grained intent recognition, format-constrained rewriting, and safety-aware query transformation; (3) We introduce a customized reinforcement learning strategy integrating rule-based engines with a multi-dimensional reward model—incorporating safety, fluency, and faithfulness metrics. Experiments demonstrate substantial improvements in defense capability across mainstream safety benchmarks (e.g., SafeBench, AdvBench) and jailbreaking attack scenarios, achieving an average 37.2% reduction in over-rejection rate while preserving—or even enhancing—response quality and user satisfaction.

Technology Category

Application Category

📝 Abstract

The rapid advancement of large language models (LLMs) has driven their adoption across diverse domains, yet their ability to generate harmful content poses significant safety challenges. While extensive research has focused on mitigating harmful outputs, such efforts often come at the cost of excessively rejecting harmless prompts. Striking a balance among safety, over-refusal, and utility remains a critical challenge. In this work, we introduce IntentionReasoner, a novel safeguard mechanism that leverages a dedicated guard model to perform intent reasoning, multi-level safety classification, and query rewriting to neutralize potentially harmful intent in edge-case queries. Specifically, we first construct a comprehensive dataset comprising approximately 163,000 queries, each annotated with intent reasoning, safety labels, and rewritten versions. Supervised fine-tuning is then applied to equip the guard model with foundational capabilities in format adherence, intent analysis, and safe rewriting. Finally, we apply a tailored multi-reward optimization strategy that integrates rule-based heuristics and reward model signals within a reinforcement learning framework to further enhance performance. Extensive experiments show that IntentionReasoner excels in multiple safeguard benchmarks, generation quality evaluations, and jailbreak attack scenarios, significantly enhancing safety while effectively reducing over-refusal rates and improving the quality of responses.

Problem

Research questions and friction points this paper is trying to address.

Balancing safety and utility in LLM safeguards

Reducing over-refusal of harmless prompts

Neutralizing harmful intent in edge-case queries

Innovation

Methods, ideas, or system contributions that make the work stand out.

Intent reasoning and multi-level safety classification

Supervised fine-tuning for format adherence and rewriting

Multi-reward optimization with reinforcement learning framework

🔎 Similar Papers

No similar papers found.

Authors to Follow