🤖 AI Summary
This work addresses the challenge of enabling robots to learn long-horizon manipulation tasks from a small number of unlabeled demonstrations. It proposes the first end-to-end neuro-symbolic imitation learning framework that requires no manual domain engineering. The method leverages a vision-language model to automatically segment skills, infer high-level states, and construct state transition graphs from 1–30 unlabeled demonstrations. These symbolic abstractions are then used to generate PDDL planning domains via Answer Set Programming, which in turn guide data-efficient policy learning. Task-specific observation and action space pruning, combined with cross-object data augmentation, further enhance generalization. Experiments on a real industrial forklift and a Kinova Gen3 robotic arm demonstrate that the system achieves complex tasks with minimal demonstrations, exhibiting high data efficiency, strong generalization, and practical deployability.
📝 Abstract
Enabling robots to learn long-horizon manipulation tasks from a handful of demonstrations remains a central challenge in robotics. Existing neuro-symbolic approaches often rely on hand-crafted symbolic abstractions, semantically labeled trajectories or large demonstration datasets, limiting their scalability and real-world applicability. We present a scalable neuro-symbolic framework that autonomously constructs symbolic planning domains and data-efficient control policies from as few as one to thirty unannotated skill demonstrations, without requiring manual domain engineering. Our method segments demonstrations into skills and employs a Vision-Language Model (VLM) to classify skills and identify equivalent high-level states, enabling automatic construction of a state-transition graph. This graph is processed by an Answer Set Programming solver to synthesize a PDDL planning domain, which an oracle function exploits to isolate the minimal, task-relevant and target relative observation and action spaces for each skill policy. Policies are learned at the control reference level rather than at the raw actuator signal level, yielding a smoother and less noisy learning target. Known controllers can be leveraged for real-world data augmentation by projecting a single demonstration onto other objects in the scene, simultaneously enriching the graph construction process and the dataset for imitation learning. We validate our framework primarily on a real industrial forklift across statistically rigorous manipulation trials, and demonstrate cross-platform generality on a Kinova Gen3 robotic arm across two standard benchmarks. Our results show that grounding control learning, VLM-driven abstraction, and automated planning synthesis into a unified pipeline constitutes a practical path toward scalable, data-efficient, expert-free and interpretable neuro-symbolic robotics.