🤖 AI Summary
In multi-objective recommendation, dynamic objective conflicts and heterogeneous user preferences cause the Pareto frontier to vary across contexts, rendering static weighted approaches inadequate for capturing such contextual heterogeneity. To address this, we propose the Deep Pareto Reinforcement Learning Framework (DPRF), the first method to tightly integrate Pareto optimization with deep reinforcement learning. DPRF jointly models context-aware, heterogeneous inter-objective relationships to learn both short- and long-term Pareto-optimal policies. It enables dynamic objective trade-offs, personalized preference modeling, and online policy adaptation. Offline experiments on multiple benchmarks demonstrate significant improvements over state-of-the-art methods. Furthermore, online A/B tests on Alibaba’s video platform simultaneously improve three conflicting metrics—click-through rate, watch time, and paid conversion rate—validating DPRF’s effectiveness and practical business value.
📝 Abstract
Optimizing multiple objectives simultaneously is an important task for recommendation platforms to improve their performance. However, this task is particularly challenging since the relationships between different objectives are heterogeneous across different consumers and dynamically fluctuating according to different contexts. Especially in those cases when objectives become conflicting with each other, the result of recommendations will form a pareto-frontier, where the improvements of any objective comes at the cost of a performance decrease of another objective. Existing multi-objective recommender systems do not systematically consider such dynamic relationships; instead, they balance between these objectives in a static and uniform manner, resulting in only suboptimal multi-objective recommendation performance. In this paper, we propose a Deep Pareto Reinforcement Learning (DeepPRL) approach, where we (1) comprehensively model the complex relationships between multiple objectives in recommendations; (2) effectively capture personalized and contextual consumer preference for each objective to provide better recommendations; (3) optimize both the short-term and the long-term performance of multi-objective recommendations. As a result, our method achieves significant pareto-dominance over the state-of-the-art baselines in the offline experiments. Furthermore, we conducted a controlled experiment at the video streaming platform of Alibaba, where our method simultaneously improved three conflicting business objectives over the latest production system significantly, demonstrating its tangible economic impact in practice.