Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning

📅 2025-06-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address insufficient intent understanding by question-answering agents under ambiguous queries, this paper proposes Reward-Weighted Supervised Fine-Tuning (RW-SFT), an end-to-end method for training large language models to autonomously generate high-quality clarification questions. RW-SFT leverages synthetically constructed dialogue data and adopts an offline reinforcement learning paradigm, directly incorporating reward signals into the supervised objective—thereby circumventing key limitations of standard Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), such as indirect reward optimization and strong hyperparameter sensitivity. Experiments demonstrate that RW-SFT significantly outperforms both SFT and DPO baselines in both clarification question quality and downstream answer accuracy. Crucially, it achieves joint improvement in reward metrics, linguistic fluency, and semantic relevance. This work establishes a new paradigm for low-overhead, robust intent clarification in QA agents.

Technology Category

Application Category

📝 Abstract
Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weighted supervised fine-tuning (SFT) and easily optimized in large language models. Our work stands in a stark contrast to recently proposed methods, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize rewards. We compare to these methods empirically and report gains in both optimized rewards and language quality.
Problem

Research questions and friction points this paper is trying to address.

Enhancing QA agents with clarifying questions via RL
Optimizing offline RL for reward-weighted fine-tuning
Improving reward and language quality over SFT methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for QA clarification
Offline RL with reward-weighted fine-tuning
Optimizes rewards without extra hyper-parameters
🔎 Similar Papers
No similar papers found.