NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

πŸ“… 2026-03-10
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing vision-language-action (VLA) models struggle to learn reusable, structured primitives, often requiring large amounts of data and exhibiting limited exploration capabilities. This work proposes a novel framework that integrates a symbolic encoder with a symbolic solver, marking the first application of neuro-symbolic methods to VLA. The symbolic encoder extracts structured primitives from visual and linguistic inputs, while the symbolic solver efficiently generates action sequences. To further enhance exploration, the framework incorporates online reinforcement learning, expanding the agent’s search space. This approach substantially improves data efficiency, one-shot learning performance, robustness to perturbations, and zero-shot generalization. Empirical evaluations on robotic manipulation tasks demonstrate clear superiority over current state-of-the-art methods.

Technology Category

Application Category

πŸ“ Abstract
Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
reusable primitives
data efficiency
exploration beyond demonstrations
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-Symbolic
Vision-Language-Action
Symbolic Encoder
Online Reinforcement Learning
Data Efficiency
πŸ”Ž Similar Papers
No similar papers found.
Z
Ziyue Zhu
Beijing University of Posts and Telecommunications
S
Shangyang Wu
Beijing University of Posts and Telecommunications
S
Shuai Zhao
Beijing University of Posts and Telecommunications
Z
Zhiqiu Zhao
Beijing University of Posts and Telecommunications
S
Shengjie Li
Beijing University of Posts and Telecommunications
Y
Yi Wang
Nanyang Technological University
F
Fang Li
Nanyang Technological University
Haoran Luo
Haoran Luo
Nanyang Technological University
Knowledge GraphLarge Language ModelsGraph Neural Networks