Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning

📅 2025-06-08

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

To address insufficient intent understanding by question-answering agents under ambiguous queries, this paper proposes Reward-Weighted Supervised Fine-Tuning (RW-SFT), an end-to-end method for training large language models to autonomously generate high-quality clarification questions. RW-SFT leverages synthetically constructed dialogue data and adopts an offline reinforcement learning paradigm, directly incorporating reward signals into the supervised objective—thereby circumventing key limitations of standard Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), such as indirect reward optimization and strong hyperparameter sensitivity. Experiments demonstrate that RW-SFT significantly outperforms both SFT and DPO baselines in both clarification question quality and downstream answer accuracy. Crucially, it achieves joint improvement in reward metrics, linguistic fluency, and semantic relevance. This work establishes a new paradigm for low-overhead, robust intent clarification in QA agents.

Technology Category

Application Category

📝 Abstract

Question answering (QA) agents automatically answer questions posed in natural language. In this work, we learn to ask clarifying questions in QA agents. The key idea in our method is to simulate conversations that contain clarifying questions and learn from them using reinforcement learning (RL). To make RL practical, we propose and analyze offline RL objectives that can be viewed as reward-weighted supervised fine-tuning (SFT) and easily optimized in large language models. Our work stands in a stark contrast to recently proposed methods, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize rewards. We compare to these methods empirically and report gains in both optimized rewards and language quality.

Problem

Research questions and friction points this paper is trying to address.

Enhancing QA agents with clarifying questions via RL

Optimizing offline RL for reward-weighted fine-tuning

Improving reward and language quality over SFT methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement learning for QA clarification

Offline RL with reward-weighted fine-tuning

Optimizes rewards without extra hyper-parameters

🔎 Similar Papers

Enhancing the Capability and Robustness of Large Language Models through Reinforcement Learning-Driven Query Refinement