🤖 AI Summary
This study addresses sequential decision optimization in notification push systems, aiming to balance message utility and user fatigue. We propose a multi-objective offline reinforcement learning framework based on the Decision Transformer architecture. First, we employ quantile regression to model return-to-go, enhancing robustness in long-term reward estimation. Second, we design a non-episodic multi-reward mechanism that explicitly decouples utility and fatigue signals. Third, we develop a ring-buffer-based sequence processing system enabling near-real-time inference and interpretable analysis. By reformulating policy learning as conditional supervised learning, our approach achieves efficient multi-objective optimization in high-dimensional recommendation settings. Online A/B tests conducted in LinkedIn’s production environment demonstrate that our method improves session count by 0.72% over a baseline multi-objective Conservative Q-Learning (CQL) approach, while significantly enhancing notification relevance, user engagement, and long-term activity—without exacerbating user fatigue.
📝 Abstract
Notifications are an important communication channel for delivering timely and relevant information. Optimizing their delivery involves addressing complex sequential decision-making challenges under constraints such as message utility and user fatigue. Offline reinforcement learning (RL) methods, such as Conservative Q-Learning (CQL), have been applied to this problem but face practical challenges at scale, including instability, sensitivity to distribution shifts, limited reproducibility, and difficulties with explainability in high-dimensional recommendation settings. We present a Decision Transformer (DT) based framework that reframes policy learning as return-conditioned supervised learning, improving robustness, scalability, and modeling flexibility. Our contributions include a real-world comparison with CQL, a multi-reward design suitable for non-episodic tasks, a quantile regression approach to return-to-go conditioning, and a production-ready system with circular buffer-based sequence processing for near-real-time inference. Extensive offline and online experiments in a deployed notification system show that our approach improves notification utility and overall session activity while minimizing user fatigue. Compared to a multi-objective CQL-based agent, the DT-based approach achieved a +0.72% increase in sessions for notification decision-making at LinkedIn by making notification recommendation more relevant.