🤖 AI Summary
This work addresses the challenge of scaling reinforcement learning to large discrete or hybrid action spaces, where traditional approaches suffer from the curse of dimensionality due to reliance on grid structures or expensive nearest-neighbor searches. The authors propose Distance-Guided Reinforcement Learning (DGRL), which leverages a random-body exploration mechanism in a semantic embedding space to achieve full coverage of local trust regions. DGRL introduces Sampling-based Dynamic Neighborhoods (SDN) and Distance-Based Policy Updates (DBU), transforming policy optimization into a stable regression task that decouples gradient variance from action space size. This enables effective non-hierarchical modeling of hybrid actions. Experiments demonstrate that DGRL outperforms state-of-the-art methods by up to 66% in both structured and unstructured environments, while significantly accelerating convergence and reducing computational complexity.
📝 Abstract
Reinforcement Learning is increasingly applied to logistics, scheduling, and recommender systems, but standard algorithms struggle with the curse of dimensionality in such large discrete action spaces. Existing algorithms typically rely on restrictive grid-based structures or computationally expensive nearest-neighbor searches, limiting their effectiveness in high-dimensional or irregularly structured domains. We propose Distance-Guided Reinforcement Learning (DGRL), combining Sampled Dynamic Neighborhoods (SDN) and Distance-Based Updates (DBU) to enable efficient RL in spaces with up to 10$^\text{20}$ actions. Unlike prior methods, SDN leverages a semantic embedding space to perform stochastic volumetric exploration, provably providing full support over a local trust region. Complementing this, DBU transforms policy optimization into a stable regression task, decoupling gradient variance from action space cardinality and guaranteeing monotonic policy improvement. DGRL naturally generalizes to hybrid continuous-discrete action spaces without requiring hierarchical dependencies. We demonstrate performance improvements of up to 66% against state-of-the-art benchmarks across regularly and irregularly structured environments, while simultaneously improving convergence speed and computational complexity.