Multi-task Offline Reinforcement Learning for Online Advertising in Recommender Systems

📅 2025-06-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Offline reinforcement learning (RL) for ad recommendation under sparse rewards suffers from overestimation bias, distributional shift, and inadequate modeling of budget constraints. Method: We propose a causal Markov decision process framework tailored for advertising decisions, featuring causal state encoding and conditional sequence modeling. We design a multi-task offline RL architecture incorporating causal attention mechanisms to jointly optimize channel recommendation and dynamic budget allocation, while enabling policy decoupling. Contribution/Results: Our method significantly outperforms state-of-the-art baselines in both offline evaluation and large-scale online A/B tests. It effectively mitigates overestimation and distributional shift, yielding consistent improvements in core metrics—including CTR, CVR, and ROI—as well as overall system revenue.

Technology Category

Application Category

📝 Abstract
Online advertising in recommendation platforms has gained significant attention, with a predominant focus on channel recommendation and budget allocation strategies. However, current offline reinforcement learning (RL) methods face substantial challenges when applied to sparse advertising scenarios, primarily due to severe overestimation, distributional shifts, and overlooking budget constraints. To address these issues, we propose MTORL, a novel multi-task offline RL model that targets two key objectives. First, we establish a Markov Decision Process (MDP) framework specific to the nuances of advertising. Then, we develop a causal state encoder to capture dynamic user interests and temporal dependencies, facilitating offline RL through conditional sequence modeling. Causal attention mechanisms are introduced to enhance user sequence representations by identifying correlations among causal states. We employ multi-task learning to decode actions and rewards, simultaneously addressing channel recommendation and budget allocation. Notably, our framework includes an automated system for integrating these tasks into online advertising. Extensive experiments on offline and online environments demonstrate MTORL's superiority over state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses overestimation and distribution shifts in offline RL for ads
Develops causal state encoder to model dynamic user interests
Integrates multi-task learning for channel and budget decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-task offline RL for ad recommendations
Causal state encoder for dynamic user interests
Automated system integrating tasks online
Langming Liu
Langming Liu
PhD, City University of Hongkong
RecommendationLarge Language ModelsFederated Learning
W
Wanyu Wang
Southern University of Science and Technology, City University of Hong Kong
C
Chi Zhang
Harbin Engineering University
B
Bo Li
City University of Hong Kong
Hongzhi Yin
Hongzhi Yin
Professor and ARC Future Fellow, University of Queensland
Recommender SystemGraph LearningSpatial-temporal PredictionEdge IntelligenceLLM
Xuetao Wei
Xuetao Wei
Associate Professor, Southern University of Science and Technology
AI EthicsAI Safety
W
Wenbo Su
Taobao & Tmall Group of Alibaba
B
Bo Zheng
Taobao & Tmall Group of Alibaba
X
Xiangyu Zhao
City University of Hong Kong