Representative Action Selection for Large Action Space: From Bandits to MDPs

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of intractable action-space enumeration in reinforcement learning with large-scale action spaces, this paper proposes an action-subset selection framework tailored to environmental heterogeneity, designed to identify a small yet representative subset of actions for each state—guaranteed to contain a near-optimal action. Methodologically, we extend the meta-bandit paradigm to Markov decision processes (MDPs) by constructing a decentralized sub-Gaussian process model, integrated with submodular optimization and statistical learning theory to enable efficient action pruning. Theoretically, we establish near-optimality guarantees for the selected subsets across diverse environments. Empirically, our approach achieves performance comparable to full-action-space learning while substantially reducing both sample and computational complexity. This work provides a theoretically rigorous and practically viable solution for high-dimensional action domains, including inventory management and recommender systems.

Technology Category

Application Category

📝 Abstract
We study the problem of selecting a small, representative action subset from an extremely large action space shared across a family of reinforcement learning (RL) environments -- a fundamental challenge in applications like inventory management and recommendation systems, where direct learning over the entire space is intractable. Our goal is to identify a fixed subset of actions that, for every environment in the family, contains a near-optimal action, thereby enabling efficient learning without exhaustively evaluating all actions. This work extends our prior results for meta-bandits to the more general setting of Markov Decision Processes (MDPs). We prove that our existing algorithm achieves performance comparable to using the full action space. This theoretical guarantee is established under a relaxed, non-centered sub-Gaussian process model, which accommodates greater environmental heterogeneity. Consequently, our approach provides a computationally and sample-efficient solution for large-scale combinatorial decision-making under uncertainty.
Problem

Research questions and friction points this paper is trying to address.

Selecting a small representative action subset from large spaces
Enabling efficient learning without evaluating all actions
Extending meta-bandit results to Markov Decision Processes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Selects representative subset from large action space
Extends meta-bandits algorithm to Markov Decision Processes
Uses relaxed sub-Gaussian model for environmental heterogeneity
🔎 Similar Papers