🤖 AI Summary
This work proposes the Language Motion Primitives (LMP) framework to bridge the semantic gap between natural language instructions and low-level robot motion control. By integrating vision-language models (VLMs) with Dynamic Movement Primitives (DMPs) for the first time, LMP maps high-level task reasoning—guided by interpretable, few-parameter prompts—directly into the DMP parameter space, enabling the generation of continuous, stable, and diverse manipulation trajectories. The approach supports zero-shot generalization, allowing robots to execute tabletop tasks directly from natural language without task-specific training. Evaluated on 20 real-world tasks, LMP achieves an 80% success rate, substantially outperforming baseline methods (31%) and demonstrating strong effectiveness and generalization capability.
📝 Abstract
Enabling robots to perform novel manipulation tasks from natural language instructions remains a fundamental challenge in robotics, despite significant progress in generalized problem solving with foundational models. Large vision and language models (VLMs) are capable of processing high-dimensional input data for visual scene and language understanding, as well as decomposing tasks into a sequence of logical steps; however, they struggle to ground those steps in embodied robot motion. On the other hand, robotics foundation models output action commands, but require in-domain fine-tuning or experience before they are able to perform novel tasks successfully. At its core, there still remains the fundamental challenge of connecting abstract task reasoning with low-level motion control. To address this disconnect, we propose Language Movement Primitives (LMPs), a framework that grounds VLM reasoning in Dynamic Movement Primitive (DMP) parameterization. Our key insight is that DMPs provide a small number of interpretable parameters, and VLMs can set these parameters to specify diverse, continuous, and stable trajectories. Put another way: VLMs can reason over free-form natural language task descriptions, and semantically ground their desired motions into DMPs -- bridging the gap between high-level task reasoning and low-level position and velocity control. Building on this combination of VLMs and DMPs, we formulate our LMP pipeline for zero-shot robot manipulation that effectively completes tabletop manipulation problems by generating a sequence of DMP motions. Across 20 real-world manipulation tasks, we show that LMP achieves 80% task success as compared to 31% for the best-performing baseline. See videos at our website: https://collab.me.vt.edu/lmp