UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This paper addresses the limited adaptability of user-centric intelligent agents in multi-turn dynamic interactions. We propose UserRL, a unified reinforcement learning framework that standardizes training and evaluation via a Gym-compatible environment integrated with open-source simulated users (e.g., Qwen3, GPT-4o). UserRL introduces a systematic reward shaping mechanism, empirically validates the critical role of supervised fine-tuning (SFT) as cold-start initialization for continual RL optimization, and employs the GRPO algorithm to jointly optimize episode-level reward allocation and trajectory-level scoring. Experiments demonstrate that UserRL significantly improves multi-turn interaction efficiency and response quality, enables stable training across model scales, and achieves a favorable cost-performance trade-off. The code and datasets are publicly released.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.

Problem

Research questions and friction points this paper is trying to address.

Training interactive agents that effectively assist diverse users

Developing standardized evaluation for user-centric agent abilities

Optimizing reward formulations for multi-turn interaction learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training user-centric agents via reinforcement learning

Using standardized gym environments with simulated users

Analyzing reward assignment and trajectory scoring effects

🔎 Similar Papers

No similar papers found.