Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing user satisfaction and service cost in task-oriented dialogue systems, a trade-off inadequately captured by existing approaches. To this end, we propose InteractCS-RL, a framework that formulates dialogue management as a multi-granularity reinforcement learning process, incorporating a user-centric interaction mechanism and a cost-aware multi-turn policy optimization (CMPO) strategy. Our approach integrates a high-fidelity user-role-driven simulation environment, hybrid advantage estimation, credit assignment over generation steps, and a PID-Lagrangian cost controller, enabling the first practical exploration of the Pareto frontier in real-world deployment scenarios. Experimental results demonstrate that InteractCS-RL significantly outperforms baseline methods across three key metrics: user reward, cost control, and task success rate.

Technology Category

Application Category

📝 Abstract
The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.
Problem

Research questions and friction points this paper is trying to address.

task-oriented dialogue
utility-cost trade-off
budget-aware decision-making
empathetic communication
service agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning
Cost-aware Policy Optimization
Task-oriented Dialogue
Multi-granularity Interaction
Pareto Optimization
🔎 Similar Papers
No similar papers found.
N
Ning Gao
Meituan, Beijing, China
W
Wei Zhang
Meituan, Beijing, China
Yuqin Dai
Yuqin Dai
Tsinghua University
LLMAI4ScienceAvatarGenerative Model
L
Ling Shi
Meituan, Beijing, China
Z
Ziyin Wang
Meituan, Beijing, China
Y
Yujie Wang
Meituan, Beijing, China
W
Wei He
Meituan, Beijing, China
J
Jinpeng Wang
Meituan, Beijing, China
Chaozheng Wang
Chaozheng Wang
The Chinese University of Hong Kong
software engineeringartificial intelligence