🤖 AI Summary
Deep reinforcement learning suffers from low sample efficiency and limited human-like hierarchical planning capabilities. To address these challenges, this paper proposes a neuro-symbolic synergistic framework that jointly optimizes the differentiable symbolic planner Dylan with end-to-end deep RL. Dylan innovatively integrates first-order logic reasoning, differentiable program execution, and dynamic reward shaping—overcoming key bottlenecks of classical symbolic planners, including susceptibility to deadlocks and incompatibility with neural networks. The method combines PPO-based policy gradient optimization with a hierarchical policy composition mechanism, enabling unified training of high-level behavioral orchestration and low-level action control. Evaluated on multi-task navigation and manipulation benchmarks, the approach reduces training steps by up to 73%, significantly improves cross-task generalization, and supports zero-shot transfer to unseen combinations of sub-goals.
📝 Abstract
When tackling complex problems, humans naturally break them down into smaller, manageable subtasks and adjust their initial plans based on observations. For instance, if you want to make coffee at a friend's place, you might initially plan to grab coffee beans, go to the coffee machine, and pour them into the machine. Upon noticing that the machine is full, you would skip the initial steps and proceed directly to brewing. In stark contrast, state of the art reinforcement learners, such as Proximal Policy Optimization (PPO), lack such prior knowledge and therefore require significantly more training steps to exhibit comparable adaptive behavior. Thus, a central research question arises: extit{How can we enable reinforcement learning (RL) agents to have similar ``human priors'', allowing the agent to learn with fewer training interactions?} To address this challenge, we propose differentiable symbolic planner (Dylan), a novel framework that integrates symbolic planning into Reinforcement Learning. Dylan serves as a reward model that dynamically shapes rewards by leveraging human priors, guiding agents through intermediate subtasks, thus enabling more efficient exploration. Beyond reward shaping, Dylan can work as a high level planner that composes primitive policies to generate new behaviors while avoiding common symbolic planner pitfalls such as infinite execution loops. Our experimental evaluations demonstrate that Dylan significantly improves RL agents' performance and facilitates generalization to unseen tasks.