RoboPaint: From Human Demonstration to Any Robot and Any View

📅 2026-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the scarcity of large-scale, high-fidelity robotic demonstration data that limits the scalability of vision–language–action (VLA) models for dexterous manipulation. To overcome this challenge, the authors propose a Real-Sim-Real pipeline that leverages multimodal human demonstrations—comprising RGB-D video, glove-based joint angles, and tactile signals—and introduces a tactile-aware retargeting method. This method combines geometric and force-guided optimization to efficiently map human hand motions onto arbitrary dexterous hands. The resulting trajectories are used to generate cross-robot, multi-view, high-fidelity simulation data in Isaac Sim, eliminating the need for real-world teleoperation. A Pi0.5 VLA policy trained on this synthetic data achieves an average success rate of 80% across three representative tasks, while the retargeted trajectories attain an 84% success rate across ten dexterous manipulation tasks, demonstrating the approach’s effectiveness and strong generalization capability.

Technology Category

Application Category

📝 Abstract
Acquiring large-scale, high-fidelity robot demonstration data remains a critical bottleneck for scaling Vision-Language-Action (VLA) models in dexterous manipulation. We propose a Real-Sim-Real data collection and data editing pipeline that transforms human demonstrations into robot-executable, environment-specific training data without direct robot teleoperation. Standardized data collection rooms are built to capture multimodal human demonstrations (synchronized 3 RGB-D videos, 11 RGB videos, 29-DoF glove joint angles, and 14-channel tactile signals). Based on these human demonstrations, we introduce a tactile-aware retargeting method that maps human hand states to robot dex-hand states via geometry and force-guided optimization. Then the retargeted robot trajectories are rendered in a photorealistic Isaac Sim environment to build robot training data. Real world experiments have demonstrated: (1) The retargeted dex-hand trajectories achieve an 84\% success rate across 10 diverse object manipulation tasks. (2) VLA policies (Pi0.5) trained exclusively on our generated data achieve 80\% average success rate on three representative tasks, i.e., pick-and-place, pushing and pouring. To conclude, robot training data can be efficiently"painted"from human demonstrations using our real-sim-real data pipeline. We offer a scalable, cost-effective alternative to teleoperation with minimal performance loss for complex dexterous manipulation.
Problem

Research questions and friction points this paper is trying to address.

robot demonstration data
Vision-Language-Action models
dexterous manipulation
data bottleneck
teleoperation
Innovation

Methods, ideas, or system contributions that make the work stand out.

tactile-aware retargeting
Real-Sim-Real pipeline
Vision-Language-Action models
dexterous manipulation
human-to-robot demonstration transfer
🔎 Similar Papers
No similar papers found.
J
Jiacheng Fan
Paxini Tech.
Z
Zhiyue Zhao
Zhejiang University
Y
Yiqian Zhang
Paxini Tech.
C
Chao Chen
Paxini Tech.
P
Peide Wang
Paxini Tech.
H
Hengdi Zhang
Paxini Tech.
Zhengxue Cheng
Zhengxue Cheng
Assistant Researcher, Shanghai Jiao Tong University
Video and Image CodingComputer VisionImage Quality Assessment