π€ AI Summary
This work addresses the challenge of modeling long-term dialogue value in open-domain conversational systems, which is hindered by reliance on static user data and the short-sighted bias inherent in conventional reinforcement learning. To overcome this, the authors propose a two-agent adversarial framework: a user agent dynamically simulates user stylistic preferences and proactive termination behavior to drive the dialogue agentβs exploration of user interests. They further introduce an Adaptive Tree-based Grouped Relative Policy Optimization (AT-GRPO) algorithm that structures dialogue trajectories as trees and incorporates a stage-aware reward aggregation mechanism. This approach preserves the ability to model long-horizon rewards while reducing computational complexity from exponential to polynomial. Experimental results demonstrate that the proposed method significantly outperforms existing baselines in terms of performance, sample efficiency, and robustness.
π Abstract
Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users'traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.