🤖 AI Summary
This work addresses the significant performance degradation often observed in programmatic reinforcement learning due to policy expressivity collapse when applying post-hoc discretization to continuously relaxed policies. To circumvent reliance on post-hoc discretization and subsequent fine-tuning, the authors propose DiPRL, a method that directly learns near-discrete programmatic policies during differentiable training. DiPRL integrates continuous relaxation, differentiable program synthesis, and a novel entropy regularization mechanism over program architectures to naturally guide policy convergence toward highly interpretable discrete forms. Experimental results demonstrate that DiPRL consistently generates concise, interpretable programmatic policies and achieves strong performance across a range of both discrete and continuous reinforcement learning tasks.
📝 Abstract
Programmatic reinforcement learning (PRL) offers an interpretable alternative to deep reinforcement learning by representing policies as human-readable and -editable programs. While gradient-based methods have been developed to optimize continuous relaxations of programs, they face a significant performance drop when converting the continuous relaxations back into discrete programs. Post-hoc discretization can discard optimized branches and parameters in a program, which results in a collapse of policy expressivity and lowered task performance, leading in turn to a need for additional fine-tuning. To overcome these limitations, we propose Differentiable Discrete Programmatic Reinforcement Learning (DiPRL), a method that learns programmatic policies that become nearly discrete during training, avoiding a separate post-hoc fine-tuning stage. We first analyze the inherent risks of performance drop introduced by post-hoc discretization of gradient-based methods. Then, we introduce programmatic architecture entropy regularization, which enables smooth, differentiable training that encourages convergence toward a discrete program. DiPRL maintains the efficiency of gradient-based optimization while mitigating the risks of post-hoc discretization. Our experiments across multiple discrete and continuous RL tasks demonstrate that DiPRL can achieve strong performance via interpretable programmatic policies.