🤖 AI Summary
This work addresses the issue of action suppression in state-dependent action validity settings, where unmasked policy gradient algorithms suffer from exponential decay in the probabilities of valid actions in unvisited states due to parameter sharing and softmax-induced gradient propagation. The study provides the first theoretical characterization of this suppression mechanism and reveals an inherent trade-off induced by entropy regularization between preserving valid actions and achieving sample efficiency. To circumvent the need for oracle-provided action masks, the authors propose a feasibility classifier as a learnable alternative. Experimental validation across Craftax, Craftax-Classic, and MiniHack environments demonstrates that the proposed method effectively mitigates action suppression without access to ground-truth masks, enabling robust policy deployment under realistic conditions.
📝 Abstract
In reinforcement learning environments with state-dependent action validity, action masking consistently outperforms penalty-based handling of invalid actions, yet existing theory only shows that masking preserves the policy gradient theorem. We identify a distinct failure mode of unmasked training: it systematically suppresses valid actions at states the agent has not yet visited. This occurs because gradients pushing down invalid actions at visited states propagate through shared network parameters to unvisited states where those actions are valid. We prove that for softmax policies with shared features, when an action is invalid at visited states but valid at an unvisited state $s^*$, the probability $π(a \mid s^*)$ is bounded by exponential decay due to parameter sharing and the zero-sum identity of softmax logits. This bound reveals that entropy regularization trades off between protecting valid actions and sample efficiency, a tradeoff that masking eliminates. We validate empirically that deep networks exhibit the feature alignment condition required for suppression, and experiments on Craftax, Craftax-Classic, and MiniHack confirm the predicted exponential suppression and demonstrate that feasibility classification enables deployment without oracle masks.