Online Decision Making with Generative Action Sets

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This paper addresses the “exploitation–exploration–creation” trilemma arising from generative AI dynamically expanding the action space in online decision-making. Confronting the challenge that action generation incurs explicit costs—rendering classical multi-armed bandit frameworks inadequate—we propose a Dual-Optimism algorithm: it selects existing actions via Lower Confidence Bound (LCB) and guides the generation of novel actions via Upper Confidence Bound (UCB). To our knowledge, this is the first algorithm achieving sublinear regret in settings with expandable action spaces. We theoretically establish a regret upper bound of $Oig(T^{d/(d+2)} d^{d/(d+2)} + dsqrt{T log T}ig)$, matching the current state-of-the-art. Experiments on a medical question-answering dataset demonstrate that our method achieves a superior trade-off between generation quality and decision performance, significantly improving learning efficiency under dynamic action expansion.

Technology Category

Application Category

📝 Abstract

With advances in generative AI, decision-making agents can now dynamically create new actions during online learning, but action generation typically incurs costs that must be balanced against potential benefits. We study an online learning problem where an agent can generate new actions at any time step by paying a one-time cost, with these actions becoming permanently available for future use. The challenge lies in learning the optimal sequence of two-fold decisions: which action to take and when to generate new ones, further complicated by the triangular tradeoffs among exploitation, exploration and $ extit{creation}$. To solve this problem, we propose a doubly-optimistic algorithm that employs Lower Confidence Bounds (LCB) for action selection and Upper Confidence Bounds (UCB) for action generation. Empirical evaluation on healthcare question-answering datasets demonstrates that our approach achieves favorable generation-quality tradeoffs compared to baseline strategies. From theoretical perspectives, we prove that our algorithm achieves the optimal regret of $O(T^{frac{d}{d+2}}d^{frac{d}{d+2}} + dsqrt{Tlog T})$, providing the first sublinear regret bound for online learning with expanding action spaces.

Problem

Research questions and friction points this paper is trying to address.

Balancing action generation costs with decision-making benefits

Optimizing sequential decisions for action selection and creation

Managing tradeoffs among exploitation, exploration and action creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Doubly-optimistic algorithm using LCB and UCB

Balances exploitation, exploration and action creation

Achieves sublinear regret for expanding action spaces

🔎 Similar Papers

No similar papers found.

Authors to Follow