🤖 AI Summary
Current vision-language-action (VLA) models lack high-level symbolic planning capabilities, limiting their effectiveness on long-horizon tasks; conversely, symbolic action model learning (AML) approaches suffer from poor generalization and scalability. To bridge this gap, we propose GraSP-VLA—a novel framework that introduces continuous scene graphs (CSGs) to automatically derive symbolic action representations from visual-language observations. During inference, GraSP-VLA dynamically constructs a planning domain grounded in the observed scene, serving as a symbolic coordinator for neural VLA policies and enabling tight neuro-symbolic integration. Crucially, the framework requires no handcrafted domain knowledge and supports end-to-end, observation-driven planning. Experiments on real robotic platforms demonstrate that GraSP-VLA significantly improves success rates on long-horizon tasks, enhances scalability of action sequences, and achieves superior cross-task generalization compared to prior methods.
📝 Abstract
Deploying autonomous robots that can learn new skills from demonstrations is an important challenge of modern robotics. Existing solutions often apply end-to-end imitation learning with Vision-Language Action (VLA) models or symbolic approaches with Action Model Learning (AML). On the one hand, current VLA models are limited by the lack of high-level symbolic planning, which hinders their abilities in long-horizon tasks. On the other hand, symbolic approaches in AML lack generalization and scalability perspectives. In this paper we present a new neuro-symbolic approach, GraSP-VLA, a framework that uses a Continuous Scene Graph representation to generate a symbolic representation of human demonstrations. This representation is used to generate new planning domains during inference and serves as an orchestrator for low-level VLA policies, scaling up the number of actions that can be reproduced in a row. Our results show that GraSP-VLA is effective for modeling symbolic representations on the task of automatic planning domain generation from observations. In addition, results on real-world experiments show the potential of our Continuous Scene Graph representation to orchestrate low-level VLA policies in long-horizon tasks.