🤖 AI Summary
This work addresses the challenge in deep reinforcement learning where agents often explore infeasible actions during training and execution. Existing approaches rely on handcrafted symbolic mappings and action masks to enforce domain constraints. To overcome this limitation, the authors propose the Neural-Symbolic Action Masking (NSAM) framework, which automatically learns a symbolic state model consistent with domain constraints under minimal supervision and uses it to generate action masks that exclude invalid actions. NSAM uniquely enables end-to-end integration of symbolic reasoning and policy optimization by jointly and self-supervisedly learning both the symbolic model and the action mask—without requiring manual specification of symbolic mappings—and allows them to mutually reinforce each other during training. Experimental results demonstrate that NSAM significantly improves sample efficiency and drastically reduces constraint violations across multiple constrained environments.
📝 Abstract
Deep reinforcement learning (DRL) may explore infeasible actions during training and execution. Existing approaches assume a symbol grounding function that maps high-dimensional states to consistent symbolic representations and a manually specified action masking techniques to constrain actions. In this paper, we propose Neuro-symbolic Action Masking (NSAM), a novel framework that automatically learn symbolic models, which are consistent with given domain constraints of high-dimensional states, in a minimally supervised manner during the DRL process. Based on the learned symbolic model of states, NSAM learns action masks that rules out infeasible actions. NSAM enables end-to-end integration of symbolic reasoning and deep policy optimization, where improvements in symbolic grounding and policy learning mutually reinforce each other. We evaluate NSAM on multiple domains with constraints, and experimental results demonstrate that NSAM significantly improves sample efficiency of DRL agent while substantially reducing constraint violations.