🤖 AI Summary
This work addresses the challenge of optimizing advertisers’ cumulative value under strict budget constraints in low-data regimes within online advertising. To this end, the authors propose DARA, a two-stage framework: the first stage leverages the in-context learning capability of large language models (LLMs) to generate an initial campaign plan, while the second stage refines this plan through feedback-driven reasoning for precise numerical optimization. The approach innovatively combines the few-shot generalization strength of LLMs with reinforcement learning fine-tuning, introducing a GRPO-Adaptive policy that dynamically optimizes the reference strategy. By decoupling the decision process into distinct reasoning and optimization phases, DARA achieves superior performance over existing baselines on both real-world and synthetic datasets, consistently enhancing advertisers’ cumulative value under stringent budget limitations.
📝 Abstract
Optimizing the advertiser's cumulative value of winning impressions under budget constraints poses a complex challenge in online advertising, under the paradigm of AI-Generated Bidding (AIGB). Advertisers often have personalized objectives but limited historical interaction data, resulting in few-shot scenarios where traditional reinforcement learning (RL) methods struggle to perform effectively. Large Language Models (LLMs) offer a promising alternative for AIGB by leveraging their in-context learning capabilities to generalize from limited data. However, they lack the numerical precision required for fine-grained optimization. To address this limitation, we introduce GRPO-Adaptive, an efficient LLM post-training strategy that enhances both reasoning and numerical precision by dynamically updating the reference policy during training. Built upon this foundation, we further propose DARA, a novel dual-phase framework that decomposes the decision-making process into two stages: a few-shot reasoner that generates initial plans via in-context prompting, and a fine-grained optimizer that refines these plans using feedback-driven reasoning. This separation allows DARA to combine LLMs'in-context learning strengths with precise adaptability required by AIGB tasks. Extensive experiments on both real-world and synthetic data environments demonstrate that our approach consistently outperforms existing baselines in terms of cumulative advertiser value under budget constraints.