🤖 AI Summary
This work addresses the challenge of grounding natural language instructions in complex perceptual (e.g., pixel-based) and action spaces—specifically, achieving efficient and generalizable language-to-embodied-action mapping without relying on handcrafted linguistic modules or large-scale environment-language paired datasets. We propose a neuro-symbolic reinforcement learning framework that integrates formal language semantics into data-driven representation learning, enabling end-to-end language→perception→action mapping without manual reward engineering or symbolic detectors. The framework supports few-shot training, compositional generalization, and cross-task transfer. We evaluate it on image-based grid-world environments and MuJoCo robotics tasks: using only a small number of instruction-behavior demonstrations, it robustly executes unseen linguistic compositions, significantly outperforming pure end-to-end baselines in both zero-shot generalization and task performance.
📝 Abstract
Grounding language in complex perception (e.g. pixels) and action is a key challenge when building situated agents that can interact with humans via language. In past works, this is often solved via manual design of the language grounding or by curating massive datasets relating language to elements of the environment. We propose Ground-Compose-Reinforce, a neurosymbolic framework for grounding formal language from data, and eliciting behaviours by directly tasking RL agents through this language. By virtue of data-driven learning, our framework avoids the manual design of domain-specific elements like reward functions or symbol detectors. By virtue of compositional formal language semantics, our framework achieves data-efficient grounding and generalization to arbitrary language compositions. Experiments on an image-based gridworld and a MuJoCo robotics domain show that our approach reliably maps formal language instructions to behaviours with limited data while end-to-end, data-driven approaches fail.