Q-Regularized Generative Auto-Bidding: From Suboptimal Trajectories to Optimal Policies

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work proposes QGA, a novel approach that integrates Q-value regularization with generative bidding for the first time, addressing the limitations of existing automatic bidding methods that often rely on complex architectures, extensive hyperparameter tuning, and are hindered by suboptimal historical trajectories. Built upon the Decision Transformer framework, QGA incorporates double Q-learning, multi-objective return-to-go conditioning, and local action perturbation, along with a Q-guided dual exploration mechanism that jointly optimizes policy imitation and action-value maximization. This design effectively mitigates the adverse influence of suboptimal data and significantly enhances policy generalization. Extensive experiments demonstrate that QGA achieves superior performance on public benchmarks and simulated environments, with large-scale A/B tests showing a 3.27% increase in advertising GMV and a 2.49% improvement in ROI.

Technology Category

Application Category

📝 Abstract

With the rapid development of e-commerce, auto-bidding has become a key asset in optimizing advertising performance under diverse advertiser environments. The current approaches focus on reinforcement learning (RL) and generative models. These efforts imitate offline historical behaviors by utilizing a complex structure with expensive hyperparameter tuning. The suboptimal trajectories further exacerbate the difficulty of policy learning. To address these challenges, we proposes QGA, a novel Q-value regularized Generative Auto-bidding method. In QGA, we propose to plug a Q-value regularization with double Q-learning strategy into the Decision Transformer backbone. This design enables joint optimization of policy imitation and action-value maximization, allowing the learned bidding policy to both leverage experience from the dataset and alleviate the adverse impact of the suboptimal trajectories. Furthermore, to safely explore the policy space beyond the data distribution, we propose a Q-value guided dual-exploration mechanism, in which the DT model is conditioned on multiple return-to-go targets and locally perturbed actions. This entire exploration process is dynamically guided by the aforementioned Q-value module, which provides principled evaluation for each candidate action. Experiments on public benchmarks and simulation environments demonstrate that QGA consistently achieves superior or highly competitive results compared to existing alternatives. Notably, in large-scale real-world A/B testing, QGA achieves a 3.27% increase in Ad GMV and a 2.49% improvement in Ad ROI.

Problem

Research questions and friction points this paper is trying to address.

auto-bidding

suboptimal trajectories

policy learning

reinforcement learning

generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Q-value regularization

Decision Transformer

auto-bidding