From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses the challenge of enabling real-world intelligent agents to adapt their behavior to heterogeneous user preferences—a capability poorly supported by conventional reinforcement learning due to its inability to model personalized preferences, social conformity effects, and effective skill memory mechanisms. To overcome these limitations, the authors propose a unified personalized agentic reinforcement learning framework. Central to this framework is the PARPO algorithm, which disentangles task-agnostic rewards from user-specific preference rewards and employs user anchors to stabilize learning. Additionally, they introduce a two-stage disentangled preference reward model and a Preference-aligned Skill Graph Memory (PSGM) to close the loop between preference identification, policy optimization, and skill retrieval. The approach demonstrates significant performance gains over existing memory-augmented and reinforcement learning baselines on the ETAPP, ETAPP-Hard, and SJAgent benchmarks.

📝 Abstract

Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.

Problem

Research questions and friction points this paper is trying to address.

personalized reinforcement learning

user preferences

agentic AI

preference modeling

heterogeneous rewards

Innovation

Methods, ideas, or system contributions that make the work stand out.

Personalized Reinforcement Learning

Reward Decoupling

Preference Disentanglement