Harmonizing Dense and Sparse Signals in Multi-turn RL: Dual-Horizon Credit Assignment for Industrial Sales Agents

πŸ“… 2026-03-02
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of jointly optimizing long-term business objectives (e.g., conversion rate) and immediate linguistic constraints (e.g., fluency, compliance) in industrial sales dialogues, where single-reward formulations often lead to training instability or reward hacking. To this end, we propose DuCA, a dual-timescale credit assignment framework that decouples turn-level and conversation-level reward signals. DuCA incorporates a Horizon-Invariant Advantage Normalization (HIAN) mechanism, which independently normalizes advantage functions across different temporal scales prior to gradient updates, thereby effectively balancing dense and sparse rewards. Experimental results demonstrate that, compared to the GRPO baseline, DuCA improves conversion rate by 6.82%, reduces inter-utterance repetition by 82.28%, and lowers identity disclosure rate by 27.35%, significantly enhancing both policy performance and linguistic naturalness.

Technology Category

Application Category

πŸ“ Abstract
Optimizing large language models for industrial sales requires balancing long-term commercial objectives (e.g., conversion rate) with immediate linguistic constraints such as fluency and compliance. Conventional reinforcement learning often merges these heterogeneous goals into a single reward, causing high-magnitude session-level rewards to overwhelm subtler turn-level signals, which leads to unstable training or reward hacking. To address this issue, we propose Dual-Horizon Credit Assignment (DuCA), a framework that disentangles optimization across time scales. Its core, Horizon-Independent Advantage Normalization (HIAN), separately normalizes advantages from turn-level and session-level rewards before fusion, ensuring balanced gradient contributions from both immediate and long-term objectives to the policy update. Extensive experiments with a high-fidelity user simulator show DuCA outperforms the state-of-the-art GRPO baseline, achieving a 6.82% relative improvement in conversion rate, reducing inter-sentence repetition by 82.28%, and lowering identity detection rate by 27.35%, indicating a substantial improvement for an industrial sales scenario that effectively balances the dual demands of strategic performance and naturalistic language generation.
Problem

Research questions and friction points this paper is trying to address.

multi-turn reinforcement learning
credit assignment
dense and sparse rewards
industrial sales agents
heterogeneous objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-Horizon Credit Assignment
Horizon-Independent Advantage Normalization
multi-turn reinforcement learning
reward disentanglement
industrial sales agents
Haojin Yang
Haojin Yang
Hasso Plattner Institute
Efficient Deep LearningEdge AISustainable AI
A
Ai Jian
Beijing University of Posts and Telecommunications
X
Xinyue Huang
Sun Yat-sen University
Yiwei Wang
Yiwei Wang
University of California at Merced
Natural Language ProcessingVision Language Models
W
Weipeng Zhang
Meituan
K
Ke Zeng
Meituan
X
Xunliang Cai
Meituan
J
Jingqing Ruan
Meituan