GraSP-VLA: Graph-based Symbolic Action Representation for Long-Horizon Planning with VLA Policies

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current vision-language-action (VLA) models lack high-level symbolic planning capabilities, limiting their effectiveness on long-horizon tasks; conversely, symbolic action model learning (AML) approaches suffer from poor generalization and scalability. To bridge this gap, we propose GraSP-VLA—a novel framework that introduces continuous scene graphs (CSGs) to automatically derive symbolic action representations from visual-language observations. During inference, GraSP-VLA dynamically constructs a planning domain grounded in the observed scene, serving as a symbolic coordinator for neural VLA policies and enabling tight neuro-symbolic integration. Crucially, the framework requires no handcrafted domain knowledge and supports end-to-end, observation-driven planning. Experiments on real robotic platforms demonstrate that GraSP-VLA significantly improves success rates on long-horizon tasks, enhances scalability of action sequences, and achieves superior cross-task generalization compared to prior methods.

Technology Category

Application Category

📝 Abstract

Deploying autonomous robots that can learn new skills from demonstrations is an important challenge of modern robotics. Existing solutions often apply end-to-end imitation learning with Vision-Language Action (VLA) models or symbolic approaches with Action Model Learning (AML). On the one hand, current VLA models are limited by the lack of high-level symbolic planning, which hinders their abilities in long-horizon tasks. On the other hand, symbolic approaches in AML lack generalization and scalability perspectives. In this paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that uses a Continuous Scene Graph representation to generate a symbolic representation of human demonstrations. This representation is used to generate new planning domains during inference and serves as an orchestrator for low-level VLA policies, scaling up the number of actions that can be reproduced in a row. Our results show that GraSP-VLA is effective for modeling symbolic representations on the task of automatic planning domain generation from observations. In addition, results on real-world experiments show the potential of our Continuous Scene Graph representation to orchestrate low-level VLA policies in long-horizon tasks.

Problem

Research questions and friction points this paper is trying to address.

VLA models lack symbolic planning for long-horizon tasks

Symbolic AML approaches lack generalization and scalability

Need to integrate symbolic reasoning with VLA policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous Scene Graph generates symbolic representation from demonstrations

Symbolic representation creates new planning domains during inference

Continuous Scene Graph orchestrates low-level VLA policies for long-horizon tasks

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments