🤖 AI Summary
In high-risk offline reinforcement learning for online gaming recommendation, counterfactual action estimation fails due to sparse state-space overlap between policies and experimental path bias. To address this, we propose a novel offline RL framework tailored to online game recommendation. Our contributions are threefold: (1) gradient reversal learning to construct adversarially balanced state representations, mitigating distributional shift; (2) a Q-value decomposition-based multi-objective optimization mechanism to jointly enhance interpretability and conservatism in recommendations; and (3) integration of adversarial regularization with improved conservative Q-learning, enabling parallel offline counterfactual exploration and exploitation. Deployed on a real-world volatile gaming platform, our method achieves a 0.15% increase in player return, a 2% uplift in user lifetime value (LTV), 0.4% and 2% gains in recommendation-driven engagement and platform session duration, respectively, and a 10% reduction in recommendation cost.
📝 Abstract
Recent advancements in state-of-the-art (SOTA) offline reinforcement learning (RL) have primarily focused on addressing function approximation errors, which contribute to the overestimation of Q-values for out-of-distribution actions, a challenge that static datasets exacerbate. However, high stakes applications such as recommendation systems in online gaming, introduce further complexities due to player's psychology (intent) driven by gameplay experiences and the inherent volatility on the platform. These factors create highly sparse, partially overlapping state spaces across policies, further influenced by the experiment path selection logic which biases state spaces towards specific policies. Current SOTA methods constrain learning from such offline data by clipping known counterfactual actions as out-of-distribution due to poor generalization across unobserved states. Further aggravating conservative Q-learning and necessitating more online exploration. FAST-Q introduces a novel approach that (1) leverages Gradient Reversal Learning to construct balanced state representations, regularizing the policy-specific bias between the player's state and action thereby enabling counterfactual estimation; (2) supports offline counterfactual exploration in parallel with static data exploitation; and (3) proposes a Q-value decomposition strategy for multi-objective optimization, facilitating explainable recommendations over short and long-term objectives. These innovations demonstrate superiority of FAST-Q over prior SOTA approaches and demonstrates at least 0.15 percent increase in player returns, 2 percent improvement in lifetime value (LTV), 0.4 percent enhancement in the recommendation driven engagement, 2 percent improvement in the player's platform dwell time and an impressive 10 percent reduction in the costs associated with the recommendation, on our volatile gaming platform.