Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of modeling long-term dialogue value in open-domain conversational systems, which is hindered by reliance on static user data and the short-sighted bias inherent in conventional reinforcement learning. To overcome this, the authors propose a two-agent adversarial framework: a user agent dynamically simulates user stylistic preferences and proactive termination behavior to drive the dialogue agent’s exploration of user interests. They further introduce an Adaptive Tree-based Grouped Relative Policy Optimization (AT-GRPO) algorithm that structures dialogue trajectories as trees and incorporates a stage-aware reward aggregation mechanism. This approach preserves the ability to model long-horizon rewards while reducing computational complexity from exponential to polynomial. Experimental results demonstrate that the proposed method significantly outperforms existing baselines in terms of performance, sample efficiency, and robustness.

Technology Category

Application Category

πŸ“ Abstract
Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users'traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.
Problem

Research questions and friction points this paper is trying to address.

open-ended dialogue
personalization
long-horizon reinforcement learning
short-horizon bias
user data dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent Game
Adaptive Tree-based GRPO
Long-horizon Reinforcement Learning
Online Personalization
Dialogue Trajectory Tree
πŸ”Ž Similar Papers
No similar papers found.
K
Kun Peng
Institute of Information Engineering, Chinese Academy of Sciences
C
Conghui Tan
Tencent
Y
Yu Liu
Institute of Information Engineering, Chinese Academy of Sciences
G
Guohua Tang
Tencent
Z
Zhongqian Sun
Tencent
W
Wei Yang
Tencent
Zining Zhu
Zining Zhu
Stevens Institute of Technology
Natural Language ProcessingExplainable AI
Lei Jiang
Lei Jiang
Technical Institute of Physics and Chemistry, Chinese Academy of Sciences
bio-inspired interfacial materials with superwettability
Y
Yanbing Liu
Institute of Information Engineering, Chinese Academy of Sciences
H
Hao Peng
Institute of Information Engineering, Chinese Academy of Sciences