🤖 AI Summary
Reinforcement learning (RL) suffers from inefficient exploration and poor scalability of Boltzmann policies in large-scale action spaces, where actions are represented as hyperspherical embedding vectors. Method: We propose a scalable exploration framework based on the von Mises–Fisher (vMF) distribution—the first application of vMF to RL action sampling—enabling directional, high-efficiency sampling in the state-embedding space, coupled with approximate nearest-neighbor search to rapidly retrieve high-similarity actions without exhaustive enumeration. Contribution/Results: We theoretically prove its asymptotic equivalence to the spherical Boltzmann policy, ensuring statistical soundness and computational scalability. Experiments across simulated environments, public benchmarks, and a real-world global music streaming recommendation system demonstrate substantial improvements in both exploration efficiency and policy performance. The time complexity is reduced from O(|A|) to O(log |A|), establishing a new paradigm for RL in ultra-large action spaces.
📝 Abstract
This paper introduces von Mises-Fisher exploration (vMF-exp), a scalable method for exploring large action sets in reinforcement learning problems where hyperspherical embedding vectors represent these actions. vMF-exp involves initially sampling a state embedding representation using a von Mises-Fisher distribution, then exploring this representation's nearest neighbors, which scales to virtually unlimited numbers of candidate actions. We show that, under theoretical assumptions, vMF-exp asymptotically maintains the same probability of exploring each action as Boltzmann Exploration (B-exp), a popular alternative that, nonetheless, suffers from scalability issues as it requires computing softmax values for each action. Consequently, vMF-exp serves as a scalable alternative to B-exp for exploring large action sets with hyperspherical embeddings. Experiments on simulated data, real-world public data, and the successful large-scale deployment of vMF-exp on the recommender system of a global music streaming service empirically validate the key properties of the proposed method.