Online Decision Making with Generative Action Sets

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the “exploitation–exploration–creation” trilemma arising from generative AI dynamically expanding the action space in online decision-making. Confronting the challenge that action generation incurs explicit costs—rendering classical multi-armed bandit frameworks inadequate—we propose a Dual-Optimism algorithm: it selects existing actions via Lower Confidence Bound (LCB) and guides the generation of novel actions via Upper Confidence Bound (UCB). To our knowledge, this is the first algorithm achieving sublinear regret in settings with expandable action spaces. We theoretically establish a regret upper bound of $Oig(T^{d/(d+2)} d^{d/(d+2)} + dsqrt{T log T}ig)$, matching the current state-of-the-art. Experiments on a medical question-answering dataset demonstrate that our method achieves a superior trade-off between generation quality and decision performance, significantly improving learning efficiency under dynamic action expansion.

Technology Category

Application Category

📝 Abstract
With advances in generative AI, decision-making agents can now dynamically create new actions during online learning, but action generation typically incurs costs that must be balanced against potential benefits. We study an online learning problem where an agent can generate new actions at any time step by paying a one-time cost, with these actions becoming permanently available for future use. The challenge lies in learning the optimal sequence of two-fold decisions: which action to take and when to generate new ones, further complicated by the triangular tradeoffs among exploitation, exploration and $ extit{creation}$. To solve this problem, we propose a doubly-optimistic algorithm that employs Lower Confidence Bounds (LCB) for action selection and Upper Confidence Bounds (UCB) for action generation. Empirical evaluation on healthcare question-answering datasets demonstrates that our approach achieves favorable generation-quality tradeoffs compared to baseline strategies. From theoretical perspectives, we prove that our algorithm achieves the optimal regret of $O(T^{frac{d}{d+2}}d^{frac{d}{d+2}} + dsqrt{Tlog T})$, providing the first sublinear regret bound for online learning with expanding action spaces.
Problem

Research questions and friction points this paper is trying to address.

Balancing action generation costs with decision-making benefits
Optimizing sequential decisions for action selection and creation
Managing tradeoffs among exploitation, exploration and action creation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Doubly-optimistic algorithm using LCB and UCB
Balances exploitation, exploration and action creation
Achieves sublinear regret for expanding action spaces
🔎 Similar Papers
No similar papers found.
J
Jianyu Xu
Carnegie Mellon University
V
Vidhi Jain
Carnegie Mellon University
Bryan Wilder
Bryan Wilder
Assistant Professor of Machine Learning, Carnegie Mellon University
Artificial intelligenceoptimizationmachine learningsocial networks
A
Aarti Singh
Carnegie Mellon University