🤖 AI Summary
This work addresses the challenges of credit assignment and high gradient variance in reinforcement learning with hybrid discrete-continuous action spaces, where conventional policy gradient methods suffer significant performance degradation, particularly in high-dimensional continuous action settings. To overcome these limitations, the paper proposes Hybrid Policy Optimization (HPO), which integrates pathwise derivatives and score function gradients to construct an unbiased hybrid gradient estimator within differentiable simulators. Theoretical analysis reveals that the cross-term in the hybrid gradient becomes negligible near discrete optimal responses, justifying an approximately decoupled update strategy that effectively reduces variance. Empirical results demonstrate that HPO substantially outperforms Proximal Policy Optimization (PPO) on inventory control and switched linear quadratic regulator tasks, with performance gains increasing as the dimensionality of the continuous action space grows.
📝 Abstract
We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it -- a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term -- which captures how continuous actions influence future discrete decisions -- becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at github.com/MatiasAlvo/hybrid-rl.