DARLR: Dual-Agent Offline Reinforcement Learning for Recommender Systems with Dynamic Reward

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In offline recommendation, inaccurate reward functions often cause policy degradation due to error propagation from static reward tables. Method: This paper proposes a dual-agent dynamic reward shaping framework comprising a selector–recommender cooperative mechanism: the selector dynamically identifies reference users based on user similarity and diversity, while the recommender performs policy optimization integrated with a world model; additionally, a statistics-driven uncertainty penalty mechanism is designed to adaptively calibrate reward estimation bias. The approach unifies offline reinforcement learning, dynamic reward modeling, and uncertainty-aware learning to mitigate reward mis-specification. Contribution/Results: Evaluated on four benchmark datasets, the method significantly outperforms state-of-the-art baselines, demonstrating that dynamic reward shaping coupled with uncertainty calibration substantially enhances policy robustness and recommendation performance.

Technology Category

Application Category

📝 Abstract
Model-based offline reinforcement learning (RL) has emerged as a promising approach for recommender systems, enabling effective policy learning by interacting with frozen world models. However, the reward functions in these world models, trained on sparse offline logs, often suffer from inaccuracies. Specifically, existing methods face two major limitations in addressing this challenge: (1) deterministic use of reward functions as static look-up tables, which propagates inaccuracies during policy learning, and (2) static uncertainty designs that fail to effectively capture decision risks and mitigate the impact of these inaccuracies. In this work, a dual-agent framework, DARLR, is proposed to dynamically update world models to enhance recommendation policies. To achieve this, a extbf{ extit{selector}} is introduced to identify reference users by balancing similarity and diversity so that the extbf{ extit{recommender}} can aggregate information from these users and iteratively refine reward estimations for dynamic reward shaping. Further, the statistical features of the selected users guide the dynamic adaptation of an uncertainty penalty to better align with evolving recommendation requirements. Extensive experiments on four benchmark datasets demonstrate the superior performance of DARLR, validating its effectiveness. The code is available at https://github.com/ArronDZhang/DARLR.
Problem

Research questions and friction points this paper is trying to address.

Inaccurate reward functions in offline RL for recommender systems
Static uncertainty designs fail to capture decision risks
Dynamic reward shaping needed for better recommendation policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-agent framework dynamically updates world models
Selector balances similarity and diversity for reference users
Dynamic uncertainty penalty adapts to recommendation requirements
🔎 Similar Papers
No similar papers found.
Y
Yi Zhang
The University of Queensland, CSIRO DATA61, Brisbane, Australia
Ruihong Qiu
Ruihong Qiu
ARC DECRA Fellow, Lecturer (Assistant Professor) @The University of Queensland
GraphLarge Language Models
X
Xuwei Xu
The University of Queensland, Brisbane, Australia
J
Jiajun Liu
CSIRO DATA61, The University of Queensland, Brisbane, Australia
S
Sen Wang
The University of Queensland, Brisbane, Australia