NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing vision-language-action (VLA) models struggle to learn reusable, structured primitives, often requiring large amounts of data and exhibiting limited exploration capabilities. This work proposes a novel framework that integrates a symbolic encoder with a symbolic solver, marking the first application of neuro-symbolic methods to VLA. The symbolic encoder extracts structured primitives from visual and linguistic inputs, while the symbolic solver efficiently generates action sequences. To further enhance exploration, the framework incorporates online reinforcement learning, expanding the agent’s search space. This approach substantially improves data efficiency, one-shot learning performance, robustness to perturbations, and zero-shot generalization. Empirical evaluations on robotic manipulation tasks demonstrate clear superiority over current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

Vision-Language-Action (VLA) models are formulated to ground instructions in visual context and generate action sequences for robotic manipulation. Despite recent progress, VLA models still face challenges in learning related and reusable primitives, reducing reliance on large-scale data and complex architectures, and enabling exploration beyond demonstrations. To address these challenges, we propose a novel Neuro-Symbolic Vision-Language-Action (NS-VLA) framework via online reinforcement learning (RL). It introduces a symbolic encoder to embedding vision and language features and extract structured primitives, utilizes a symbolic solver for data-efficient action sequencing, and leverages online RL to optimize generation via expansive exploration. Experiments on robotic manipulation benchmarks demonstrate that NS-VLA outperforms previous methods in both one-shot training and data-perturbed settings, while simultaneously exhibiting superior zero-shot generalizability, high data efficiency and expanded exploration space. Our code is available.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action

reusable primitives

data efficiency

exploration beyond demonstrations

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-Symbolic

Vision-Language-Action

Symbolic Encoder