KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing open-vocabulary robotic systems exhibit poor generalization in complex dynamic tasks—such as deformable object and granular material manipulation—primarily due to the neglect of explicit physical dynamics modeling. To address this, we propose an end-to-end dynamic manipulation framework that unifies vision-language understanding with physics-aware perception. Our approach introduces keypoint-based intermediate representation as a semantic-physical bridge, enabling joint vision-language model (VLM) parsing and differentiable neural dynamics modeling. It establishes a closed-loop pipeline: “language → keypoints → dynamics cost function → MPC trajectory planning.” The method integrates multimodal VLMs, differentiable neural dynamics models, self-supervised keypoint detection, and optimization-based model predictive control (MPC). Evaluated on free-form language instructions, multi-object interactions, and non-rigid object manipulation, it achieves significant improvements in cross-scenario generalization and task success rates, establishing a new paradigm for joint semantic-dynamic modeling in open-vocabulary robotics.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.
Problem

Research questions and friction points this paper is trying to address.

Integrates dynamics learning and visual prompting for robotic manipulation.
Uses keypoints to unify language instructions and visual observations.
Enables complex tasks with deformable and granular objects.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates dynamics learning and visual prompting
Uses keypoints for VLM interpretability and planning
Converts keypoints to cost functions for trajectories