🤖 AI Summary
How can large language models (LLMs) be transformed into proactive, goal-directed agents—rather than passive responders—in high-stakes dialogue settings? Existing approaches are either limited to single-turn optimization or rely on fragile, costly user simulators, resulting in substantial “reality gaps.” This paper proposes Learn-to-Ask: a framework that converts long-horizon decision-making into a supervised learning task by retroactively inferring per-turn rewards from future states using offline expert dialogue logs. It introduces structured action outputs (comprising *action* and *state_assessment*), LLM-based reward modeling, and a denoising calibration mechanism—eliminating dependence on user simulators. Evaluated on real-world clinical datasets, a 32B-parameter model outperforms human experts and has been deployed at scale in production AI services, demonstrating both practical business integration and superior performance.
📝 Abstract
Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce exttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents extit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the extbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured exttt{(action, state_assessment)} tuple, governing both extbf{what to ask} and, crucially, extbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of exttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.