Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

📅 2025-10-29

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

How can large language models (LLMs) be transformed into proactive, goal-directed agents—rather than passive responders—in high-stakes dialogue settings? Existing approaches are either limited to single-turn optimization or rely on fragile, costly user simulators, resulting in substantial “reality gaps.” This paper proposes Learn-to-Ask: a framework that converts long-horizon decision-making into a supervised learning task by retroactively inferring per-turn rewards from future states using offline expert dialogue logs. It introduces structured action outputs (comprising *action* and *state_assessment*), LLM-based reward modeling, and a denoising calibration mechanism—eliminating dependence on user simulators. Evaluated on real-world clinical datasets, a 32B-parameter model outperforms human experts and has been deployed at scale in production AI services, demonstrating both practical business integration and superior performance.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce exttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents extit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the extbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured exttt{(action, state_assessment)} tuple, governing both extbf{what to ask} and, crucially, extbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of exttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

Problem

Research questions and friction points this paper is trying to address.

Teaching LLMs to become proactive goal-oriented partners

Bridging the reality gap in offline dialogue agent training

Decomposing long-horizon problems into supervised learning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning proactive dialogue agents directly from offline expert data

Decomposing long-horizon problems into supervised learning tasks

Automated calibration pipeline purges noise from reward models

🔎 Similar Papers

Log Parsing using LLMs with Self-Generated In-Context Learning and Self-Correction