GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation

📅 2025-02-09

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Large language models (LLMs) face a dual challenge: generating hallucinated outputs on unfamiliar queries while excessive refusal undermines response utility. To address this, we propose a gradient-driven refusal-aware instruction tuning framework. Unlike conventional approaches relying on output probabilities or confidence scores, our method is the first to model refusal behavior from the perspective of parameter gradients. It introduces gradient-guided sample selection, a refusal-response identification loss, and an adaptive weighting strategy for fine-tuning—enabling dynamic trade-off between refusal accuracy and response usefulness. Evaluated on open-ended generation and multiple-choice QA tasks, our approach significantly outperforms existing refusal-aware methods: hallucination rates decrease by up to 37%, and the rate of useful responses increases by 12%, thereby jointly enhancing reliability and practicality.

Technology Category

Application Category

📝 Abstract

Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .

Problem

Research questions and friction points this paper is trying to address.

Enhance LLMs' ability to refuse unknown questions.

Minimize hallucinations while maintaining response usefulness.

Balance accurate refusals with avoiding over-refusal.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradient-driven sample selection

Adaptive weighting mechanism

Balance accurate refusals

🔎 Similar Papers

Prescribing the Right Remedy: Mitigating Hallucinations in Large Vision-Language Models via Targeted Instruction Tuning