Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of modeling long-term dialogue value in open-domain conversational systems, which is hindered by reliance on static user data and the short-sighted bias inherent in conventional reinforcement learning. To overcome this, the authors propose a two-agent adversarial framework: a user agent dynamically simulates user stylistic preferences and proactive termination behavior to drive the dialogue agent’s exploration of user interests. They further introduce an Adaptive Tree-based Grouped Relative Policy Optimization (AT-GRPO) algorithm that structures dialogue trajectories as trees and incorporates a stage-aware reward aggregation mechanism. This approach preserves the ability to model long-horizon rewards while reducing computational complexity from exponential to polynomial. Experimental results demonstrate that the proposed method significantly outperforms existing baselines in terms of performance, sample efficiency, and robustness.

Technology Category

Application Category

📝 Abstract

Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users'traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework's superior performance, sample efficiency, and robustness.

Problem

Research questions and friction points this paper is trying to address.

open-ended dialogue

personalization

long-horizon reinforcement learning

short-horizon bias

user data dependency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent Game

Adaptive Tree-based GRPO

Long-horizon Reinforcement Learning