🤖 AI Summary
Existing fair recommendation methods often struggle to balance accuracy and fairness because they mistakenly treat observed interactions—contaminated by popularity and exposure biases—as genuine user preferences. This work identifies this limitation as a failure in state estimation and proposes a Denoising State Representation Module (DSRM) based on diffusion models to recover users’ true latent states. To further disentangle long-term fairness from short-term utility, the approach integrates hierarchical reinforcement learning (HRL). Evaluated in high-fidelity simulation environments KuaiRec and KuaiRand, the method effectively disrupts the “rich-get-richer” feedback loop and achieves a superior Pareto frontier between recommendation utility and exposure fairness.
📝 Abstract
Interactive recommender systems (IRS) are increasingly optimized with Reinforcement Learning (RL) to capture the sequential nature of user-system dynamics. However, existing fairness-aware methods often suffer from a fundamental oversight: they assume the observed user state is a faithful representation of true preferences. In reality, implicit feedback is contaminated by popularity-driven noise and exposure bias, creating a distorted state that misleads the RL agent. We argue that the persistent conflict between accuracy and fairness is not merely a reward-shaping issue, but a state estimation failure. In this work, we propose \textbf{DSRM-HRL}, a framework that reformulates fairness-aware recommendation as a latent state purification problem followed by decoupled hierarchical decision-making. We introduce a Denoising State Representation Module (DSRM) based on diffusion models to recover the low-entropy latent preference manifold from high-entropy, noisy interaction histories. Built upon this purified state, a Hierarchical Reinforcement Learning (HRL) agent is employed to decouple conflicting objectives: a high-level policy regulates long-term fairness trajectories, while a low-level policy optimizes short-term engagement under these dynamic constraints. Extensive experiments on high-fidelity simulators (KuaiRec, KuaiRand) demonstrate that DSRM-HRL effectively breaks the "rich-get-richer" feedback loop, achieving a superior Pareto frontier between recommendation utility and exposure equity.