From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

πŸ“… 2026-05-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

212K/year
πŸ€– AI Summary
This work addresses the challenge of enabling real-world intelligent agents to adapt their behavior to heterogeneous user preferencesβ€”a capability poorly supported by conventional reinforcement learning due to its inability to model personalized preferences, social conformity effects, and effective skill memory mechanisms. To overcome these limitations, the authors propose a unified personalized agentic reinforcement learning framework. Central to this framework is the PARPO algorithm, which disentangles task-agnostic rewards from user-specific preference rewards and employs user anchors to stabilize learning. Additionally, they introduce a two-stage disentangled preference reward model and a Preference-aligned Skill Graph Memory (PSGM) to close the loop between preference identification, policy optimization, and skill retrieval. The approach demonstrates significant performance gains over existing memory-augmented and reinforcement learning baselines on the ETAPP, ETAPP-Hard, and SJAgent benchmarks.
πŸ“ Abstract
Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.
Problem

Research questions and friction points this paper is trying to address.

personalized reinforcement learning
user preferences
agentic AI
preference modeling
heterogeneous rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized Reinforcement Learning
Reward Decoupling
Preference Disentanglement
Skill Evolution Graph
Agentic AI
πŸ”Ž Similar Papers