Reward Balancing Revisited: Enhancing Offline Reinforcement Learning for Recommender Systems

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Addressing the challenge of balancing world-model bias mitigation and policy diversity in offline reinforcement learning–based recommendation, this paper proposes R3S, a reward redistribution framework with decaying penalty. R3S integrates model uncertainty estimation to jointly model diversity at both local (state-transition) and global (user interaction sequence) levels, employing multi-scale regularization for reward redistribution. This design explicitly alleviates inherent world-model bias while enhancing recommendation diversity. Experiments demonstrate that R3S significantly improves world-model prediction accuracy—reducing average prediction error by 18.7%—and better captures user heterogeneity in preferences. It achieves state-of-the-art performance across multiple offline recommendation benchmarks, empirically validating the synergistic effectiveness of bias correction and diversity enhancement.

Technology Category

Application Category

📝 Abstract

Offline reinforcement learning (RL) has emerged as a prevalent and effective methodology for real-world recommender systems, enabling learning policies from historical data and capturing user preferences. In offline RL, reward shaping encounters significant challenges, with past efforts to incorporate prior strategies for uncertainty to improve world models or penalize underexplored state-action pairs. Despite these efforts, a critical gap remains: the simultaneous balancing of intrinsic biases in world models and the diversity of policy recommendations. To address this limitation, we present an innovative offline RL framework termed Reallocated Reward for Recommender Systems (R3S). By integrating inherent model uncertainty to tackle the intrinsic fluctuations in reward predictions, we boost diversity for decision-making to align with a more interactive paradigm, incorporating extra penalizers with decay that deter actions leading to diminished state variety at both local and global scales. The experimental results demonstrate that R3S improves the accuracy of world models and efficiently harmonizes the heterogeneous preferences of the users.

Problem

Research questions and friction points this paper is trying to address.

Balancing intrinsic biases in world models for recommender systems

Enhancing diversity in offline RL policy recommendations

Addressing reward prediction fluctuations with model uncertainty

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates model uncertainty for reward prediction stability

Enhances diversity with decay-based penalizers for actions

Balances world model biases and policy recommendation diversity

🔎 Similar Papers

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems