🤖 AI Summary
This work addresses the challenges of inefficient exploration and training instability in maximum-entropy reinforcement learning for high-dimensional humanoid control, which stem from the curse of dimensionality. To overcome these issues, the authors propose a Dimension-wise Entropy Modulation (DEM) mechanism that dynamically allocates exploration resources across individual action dimensions. This approach is further integrated with a continuous distributional critic to mitigate value overestimation in high-dimensional action spaces. The resulting method significantly enhances both the efficiency and stability of stochastic policy learning. Empirical evaluations on the HumanoidBench benchmark demonstrate that the proposed approach outperforms existing deterministic methods, achieving performance improvements of 180% and 400% on the Basketball and Balance Hard tasks, respectively.
📝 Abstract
Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a formidable challenge, as the ``curse of dimensionality'' induces severe exploration inefficiency and training instability in expansive action spaces. Consequently, recent high-throughput paradigms have largely converged on deterministic policy gradients combined with massive parallel simulation. We challenge this compromise with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget and enforce diversity, alongside a continuous distributional critic tailored to ensure value fidelity and mitigate high-dimensional value overestimation. Extensive evaluations on HumanoidBench and other continuous control tasks demonstrate that rigorously designed stochastic policies can consistently match or outperform deterministic baselines, achieving notable gains of 180\% and 400\% on the challenging \textit{Basketball} and \textit{Balance Hard} tasks.