CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation

📅 2025-05-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Ambiguous language prompts and redundant visual cues in robotic manipulation hinder clear task intent specification. Method: This paper proposes an object-centric vision-language-action joint model featuring a novel, interpretable 2D visual prompting mechanism: geometric symbols—such as end-effector pose and motion direction—are overlaid directly onto RGB images to explicitly encode SE(3) spatial contact relationships and high-level planning intent. The model employs a multimodal fusion architecture, an SE(3)-aware action prediction head, and a keyframe-sequenced execution strategy, supporting both human-provided and automated prompt generation. Contribution/Results: Evaluated in simulation and on real robotic platforms, the approach significantly improves cross-task generalization and long-horizon robustness; multi-step dexterous manipulation success rates increase by 27% over baseline methods.

Technology Category

Application Category

📝 Abstract
In robotic, task goals can be conveyed through various modalities, such as language, goal images, and goal videos. However, natural language can be ambiguous, while images or videos may offer overly detailed specifications. To tackle these challenges, we introduce CrayonRobo that leverages comprehensive multi-modal prompts that explicitly convey both low-level actions and high-level planning in a simple manner. Specifically, for each key-frame in the task sequence, our method allows for manual or automatic generation of simple and expressive 2D visual prompts overlaid on RGB images. These prompts represent the required task goals, such as the end-effector pose and the desired movement direction after contact. We develop a training strategy that enables the model to interpret these visual-language prompts and predict the corresponding contact poses and movement directions in SE(3) space. Furthermore, by sequentially executing all key-frame steps, the model can complete long-horizon tasks. This approach not only helps the model explicitly understand the task objectives but also enhances its robustness on unseen tasks by providing easily interpretable prompts. We evaluate our method in both simulated and real-world environments, demonstrating its robust manipulation capabilities.
Problem

Research questions and friction points this paper is trying to address.

Addresses ambiguity in task goals conveyed via language or images
Develops visual-language prompts for precise robotic manipulation
Enhances robustness in unseen tasks through interpretable prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses multi-modal prompts for task goals
Generates 2D visual prompts on RGB images
Trains model to interpret visual-language prompts
🔎 Similar Papers
No similar papers found.
X
Xiaoqi Li
School of Computer Science, Peking University; PKU-Agibot Lab
Lingyun Xu
Lingyun Xu
Senior Researcher
Multi-object trackingSLAM
M
Mingxu Zhang
School of Computer Science, Peking University; PKU-Agibot Lab
J
Jiaming Liu
School of Computer Science, Peking University; PKU-Agibot Lab
Y
Yan Shen
School of Computer Science, Peking University; PKU-Agibot Lab
I
Iaroslav Ponomarenko
School of Computer Science, Peking University; PKU-Agibot Lab
Jiahui Xu
Jiahui Xu
ETH Zurich
Electronic Design AutomationFormal Verification
L
Liang Heng
School of Computer Science, Peking University; PKU-Agibot Lab
S
Siyuan Huang
School of Computer Science, Peking University; PKU-Agibot Lab
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
H
Hao Dong
School of Computer Science, Peking University; PKU-Agibot Lab