EBaReT: Expert-guided Bag Reward Transformer for Auto Bidding

📅 2025-07-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In automated bidding, suboptimal bids and low click-through–conversion rates lead to poor data quality, sparse rewards, and high uncertainty—hindering convergence of conventional reinforcement learning (RL). To address this, we propose EBaReT: a framework that models bidding as a sequential decision-making task. It integrates expert trajectory-guided training, a Positive-Unlabeled (PU) learning discriminator to filter high-quality state transitions, an expert-guided inference mechanism, and a bag-based cumulative reward function. EBaReT innovatively unifies generative RL, PU learning, Transformer architectures, and bag-level reward shaping—enhancing decision robustness while ensuring training stability. Extensive experiments across multiple real-world advertising auction scenarios demonstrate that EBaReT significantly outperforms state-of-the-art methods, validating its effectiveness and generalizability under low-quality data and sparse feedback conditions.

Technology Category

Application Category

📝 Abstract
Reinforcement learning has been widely applied in automated bidding. Traditional approaches model bidding as a Markov Decision Process (MDP). Recently, some studies have explored using generative reinforcement learning methods to address long-term dependency issues in bidding environments. Although effective, these methods typically rely on supervised learning approaches, which are vulnerable to low data quality due to the amount of sub-optimal bids and low probability rewards resulting from the low click and conversion rates. Unfortunately, few studies have addressed these challenges. In this paper, we formalize the automated bidding as a sequence decision-making problem and propose a novel Expert-guided Bag Reward Transformer (EBaReT) to address concerns related to data quality and uncertainty rewards. Specifically, to tackle data quality issues, we generate a set of expert trajectories to serve as supplementary data in the training process and employ a Positive-Unlabeled (PU) learning-based discriminator to identify expert transitions. To ensure the decision also meets the expert level, we further design a novel expert-guided inference strategy. Moreover, to mitigate the uncertainty of rewards, we consider the transitions within a certain period as a "bag" and carefully design a reward function that leads to a smoother acquisition of rewards. Extensive experiments demonstrate that our model achieves superior performance compared to state-of-the-art bidding methods.
Problem

Research questions and friction points this paper is trying to address.

Addresses low data quality in automated bidding
Mitigates uncertainty in reward acquisition
Improves long-term dependency in bidding decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-guided trajectories supplement training data
PU learning discriminator identifies expert transitions
Bag-based reward function smoothens reward acquisition
🔎 Similar Papers
2024-03-31arXiv.orgCitations: 12
Kaiyuan Li
Kaiyuan Li
Beijing University Of Posts and Telecommunications
Sequential RecommendationLarge Recommendation ModelComputational Advertising
P
Pengyu Wang
Kuaishou Technology, Beijing, China
Y
Yunshan Peng
Kuaishou Technology, Beijing, China
P
Pengjia Yuan
Kuaishou Technology, Beijing, China
Y
Yanxiang Zeng
Kuaishou Technology, Beijing, China
R
Rui Xiang
Kuaishou Technology, Beijing, China
Yanhua Cheng
Yanhua Cheng
快手
Computer VisionMachine LearningRecommendation
Xialong Liu
Xialong Liu
Kuaishou Technology
Machine LearningRecommendation
P
Peng Jiang
Kuaishou Technology, Beijing, China