VIP: Vision Instructed Pre-training for Robotic Manipulation

📅 2024-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In dexterous robotic manipulation, task diversity causes policy confusion; natural-language instructions are poorly grounded in robot data, while single-frame visual cues fail to capture dynamic target state changes. Method: We replace text-based task specifications with visual instructions as the core task modality, and introduce sparse point-flow encoding to model fine-grained inter-frame object dynamics. Our end-to-end vision-guided pretraining framework integrates sparse optical flow representations, a cross-frame action prediction network, and real-sim co-pretraining. Contribution/Results: The method significantly improves generalization across unseen tasks. It achieves breakthrough performance on high-difficulty embodied manipulation tasks—e.g., “opening a tightly sealed bottle cap”—and, for the first time, enables end-to-end pretraining and deployment of robotic manipulation policies solely from visual instructions.

Technology Category

Application Category

📝 Abstract
The effectiveness of scaling up training data in robotic manipulation is still limited. A primary challenge in manipulation is the tasks are diverse, and the trained policy would be confused if the task targets are not specified clearly. Existing works primarily rely on text instruction to describe targets. However, we reveal that current robotic data cannot train policies to understand text instruction effectively, and vision is much more comprehensible. Therefore, we introduce utilizing vision instruction to specify targets. A straightforward implementation is training a policy to predict the intermediate actions linking the current observation and a future image. Nevertheless, a single future image does not describe the task target in insufficient detail. To handle this problem, we propose to use sparse point flows to provide more detailed information. Extensive tasks are designed based on real and simulated environments to evaluate the effectiveness of our vision instructed pre-training (VIP) method. The results indicate VIP improves the performance on diverse tasks significantly, and the derived policy can complete competitive tasks like ``opening the lid of a tightly sealed bottle''.
Problem

Research questions and friction points this paper is trying to address.

Improve robotic manipulation with vision instructions
Address task diversity in robotic policies
Enhance target specification using sparse point flows
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision instructed pre-training
Sparse point flows
Intermediate action prediction
Z
Zhuoling Li
HKU
L
Liangliang Ren
CVTE
J
Jinrong Yang
CVTE
Y
Yong Zhao
CVTE
X
Xiaoyang Wu
HKU
Z
Zhenhua Xu
HKU
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence