Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

📅 2026-04-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling robots to learn long-horizon manipulation tasks from a small number of unlabeled demonstrations. It proposes the first end-to-end neuro-symbolic imitation learning framework that requires no manual domain engineering. The method leverages a vision-language model to automatically segment skills, infer high-level states, and construct state transition graphs from 1–30 unlabeled demonstrations. These symbolic abstractions are then used to generate PDDL planning domains via Answer Set Programming, which in turn guide data-efficient policy learning. Task-specific observation and action space pruning, combined with cross-object data augmentation, further enhance generalization. Experiments on a real industrial forklift and a Kinova Gen3 robotic arm demonstrate that the system achieves complex tasks with minimal demonstrations, exhibiting high data efficiency, strong generalization, and practical deployability.
📝 Abstract
Enabling robots to learn long-horizon manipulation tasks from a handful of demonstrations remains a central challenge in robotics. Existing neuro-symbolic approaches often rely on hand-crafted symbolic abstractions, semantically labeled trajectories or large demonstration datasets, limiting their scalability and real-world applicability. We present a scalable neuro-symbolic framework that autonomously constructs symbolic planning domains and data-efficient control policies from as few as one to thirty unannotated skill demonstrations, without requiring manual domain engineering. Our method segments demonstrations into skills and employs a Vision-Language Model (VLM) to classify skills and identify equivalent high-level states, enabling automatic construction of a state-transition graph. This graph is processed by an Answer Set Programming solver to synthesize a PDDL planning domain, which an oracle function exploits to isolate the minimal, task-relevant and target relative observation and action spaces for each skill policy. Policies are learned at the control reference level rather than at the raw actuator signal level, yielding a smoother and less noisy learning target. Known controllers can be leveraged for real-world data augmentation by projecting a single demonstration onto other objects in the scene, simultaneously enriching the graph construction process and the dataset for imitation learning. We validate our framework primarily on a real industrial forklift across statistically rigorous manipulation trials, and demonstrate cross-platform generality on a Kinova Gen3 robotic arm across two standard benchmarks. Our results show that grounding control learning, VLM-driven abstraction, and automated planning synthesis into a unified pipeline constitutes a practical path toward scalable, data-efficient, expert-free and interpretable neuro-symbolic robotics.
Problem

Research questions and friction points this paper is trying to address.

data-efficient robot learning
long-horizon manipulation
neuro-symbolic robotics
imitation learning
real-world robot manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

neuro-symbolic
vision-language model
data-efficient imitation learning
automated planning synthesis
robot manipulation
🔎 Similar Papers
No similar papers found.
P
Pierrick Lorang
Department of Computer Science, Tufts University, Medford, MA, USA
J
Johannes Huemer
Complex Dynamical Systems, Austrian Institute of Technology GmbH (AIT), Vienna, Austria
Timothy Duggan
Timothy Duggan
Tufts University
RoboticsVLAFoundation Models
Kai Goebel
Kai Goebel
Fragum Global
Prognostics Health ManagementResilient DesignDecision-MakingHybrid SystemsAI
P
Patrik Zips
Complex Dynamical Systems, Austrian Institute of Technology GmbH (AIT), Vienna, Austria
Matthias Scheutz
Matthias Scheutz
Karol Family Applied Technology Professor of Computer Science, Tufts University
Artificial intelligenceroboticshuman-robot interactionnatural language understanding