🤖 AI Summary
This work addresses long-horizon robotic manipulation tasks involving articulated objects, partial observability, and geometric constraints, focusing on learning composable, highly generalizable high-level behavioral representations from language-annotated demonstrations. We propose a unified framework integrating large language model–based semantic understanding, vision-language grounding, imitation learning, model-driven planning, and neural policy control. Our approach is the first to automatically extract a structured action library—including visually grounded preconditions and effects—directly from multimodal demonstrations, without requiring manually defined symbolic states or prior annotations. The learned representations enable generalization across varying initial states, goals, and environmental perturbations. We validate the framework’s effectiveness on diverse, complex object manipulation tasks in both simulation and real-robot settings.
📝 Abstract
We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.