π€ AI Summary
Existing vision-language-action (VLA) models struggle to learn reusable, structured primitives, often requiring large amounts of data and exhibiting limited exploration capabilities. This work proposes a novel framework that integrates a symbolic encoder with a symbolic solver, marking the first application of neuro-symbolic methods to VLA. The symbolic encoder extracts structured primitives from visual and linguistic inputs, while the symbolic solver efficiently generates action sequences. To further enhance exploration, the framework incorporates online reinforcement learning, expanding the agentβs search space. This approach substantially improves data efficiency, one-shot learning performance, robustness to perturbations, and zero-shot generalization. Empirical evaluations on robotic manipulation tasks demonstrate clear superiority over current state-of-the-art methods.
π Abstract
Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.