Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

πŸ“… 2026-03-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing emotionally supportive dialogue systems rely on sparse expert-provided scalar rewards, which hinder interpretability of failure cases and adaptation to users’ dynamic emotional states, often leading to deviations from the goal of fostering positive emotional shifts. To address this, this work proposes the RAPO framework, which conceptualizes dialogue as a user-response-driven process. RAPO leverages Hindsight Dialogue Selection and Generative Hindsight Feedback to construct dense natural language evaluations, and integrates both scalar and verbal feedback into a Scalar-Verbal Hybrid Policy Optimization mechanism. Notably, it introduces the first contrastive-signal-based joint optimization of language and scalar rewards. Experiments on the ESC and Sotopia datasets demonstrate that RAPO significantly outperforms existing reinforcement learning baselines, effectively enhancing the system’s capacity to guide conversations toward positive emotional transitions.

Technology Category

Application Category

πŸ“ Abstract
While current emotional support dialogue systems typically rely on expert-defined scalar rewards for alignment, these signals suffer from severe information sparsity. They cannot explain why a response failed or how to adapt to dynamic user states, often diverging from the actual goal of facilitating positive emotional shifts. In practice, the most direct and reliable learning signal emerges from the user's continuous reactions during ongoing interaction. We therefore propose Reaction Aware Policy Optimization (RAPO), a framework that optimizes over interaction consequences rather than rubric scores. RAPO treats dialogue as a reaction-driven process and utilizes simulated user responses to generate dense natural-language feedback through three core components: Hindsight Dialogue Selection, which isolates pivotal turns that meaningfully alter user emotional trajectories; Generative Hindsight Feedback, which transforms user reactions into contrastive ranking signals and natural-language critiques; and Scalar-Verbal Hybrid Policy Optimization, which couples scalar reward optimization for global alignment with verbal feedback distillation for fine-grained semantic refinement. Extensive experiments on ESC and Sotopia demonstrate that RAPO significantly outperforms strong reinforcement learning baselines in driving positive interaction outcomes.
Problem

Research questions and friction points this paper is trying to address.

emotional support dialogue
scalar rewards
user reactions
information sparsity
emotional alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reaction-Aware Policy Optimization
Scalar-Verbal Hybrid Reinforcement Learning
Generative Hindsight Feedback
Hindsight Dialogue Selection
Emotional Support Dialogue Systems
πŸ”Ž Similar Papers
No similar papers found.
J
Jing Ye
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
X
Xinpei Zhao
Independent Researcher
Lu Xiang
Lu Xiang
Institute of Automation, Chinese Academy of Sciences
Dialogue SystemsNLP
Y
Yaping Zhang
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
C
Chengqing Zong
State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China