Sample-Efficient Neurosymbolic Deep Reinforcement Learning

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the challenges of low sample efficiency and poor generalization in deep reinforcement learning when applied to sparse-reward and long-horizon tasks. The authors propose a neuro-symbolic approach that encodes partial policies learned in simple environments into transferable logical rules, which serve as prior knowledge. These rules dynamically guide exploration during training through online inference, combining action distribution biasing with Q-value rescaling to enable efficient policy optimization. Evaluated in both fully and partially observable complex GridWorld settings, the method significantly outperforms existing reward-machine baselines, achieving faster convergence, higher asymptotic performance, and enhanced model interpretability and trustworthiness.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) is a well-established framework for sequential decision-making in complex environments. However, state-of-the-art Deep RL (DRL) algorithms typically require large training datasets and often struggle to generalize beyond small-scale training scenarios, even within standard benchmarks. We propose a neuro-symbolic DRL approach that integrates background symbolic knowledge to improve sample efficiency and generalization to more challenging, unseen tasks. Partial policies defined for simple domain instances, where high performance is easily attained, are transferred as useful priors to accelerate learning in more complex settings and avoid tuning DRL parameters from scratch. To do so, partial policies are represented as logical rules, and online reasoning is performed to guide the training process through two mechanisms: (i) biasing the action distribution during exploration, and (ii) rescaling Q-values during exploitation. This neuro-symbolic integration enhances interpretability and trustworthiness while accelerating convergence, particularly in sparse-reward environments and tasks with long planning horizons. We empirically validate our methodology on challenging variants of gridworld environments, both in the fully observable and partially observable setting. We show improved performance over a state-of-the-art reward machine baseline.

Problem

Research questions and friction points this paper is trying to address.

sample efficiency

generalization

deep reinforcement learning

sparse-reward environments

long planning horizons

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neurosymbolic Reinforcement Learning

Sample Efficiency

Symbolic Knowledge Integration