🤖 AI Summary
This work addresses the challenges of dexterous hand control—namely, the high cost of real-robot teleoperation data, structural heterogeneity across hand designs, and high-dimensional action spaces—by introducing the Function–Actuator Alignment Space (FAAS) to enable unified cross-hand control. The authors construct a large-scale robot-centric dataset using a portable first-person human video capture system, establishing a human-in-the-loop data collection and training paradigm. By integrating human-to-robot retargeting, masked 3D hand point cloud processing, and visual-language-action (VLA) policy pretraining followed by fine-tuning, the approach substantially reduces reliance on robot demonstrations. Evaluated on two structurally distinct dexterous hands performing complex tool-use tasks, the method achieves an average task progress of 81%, significantly outperforming existing VLA baselines and demonstrating strong spatial, object, and zero-shot cross-hand generalization capabilities.
📝 Abstract
Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-language-action (VLA) policy and a practical human-data capture setup for universal dexterous hand control. First, we construct UniDex-Dataset, a robot-centric dataset over 50K trajectories across eight dexterous hands (6--24 DoFs), derived from egocentric human video datasets. To transform human data into robot-executable trajectories, we employ a human-in-the-loop retargeting procedure to align fingertip trajectories while preserving plausible hand-object contacts, and we operate on explicit 3D pointclouds with human hands masked to narrow kinematic and visual gaps. Second, we introduce the Function-Actuator-Aligned Space (FAAS), a unified action space that maps functionally similar actuators to shared coordinates, enabling cross-hand transfer. Leveraging FAAS as the action parameterization, we train UniDex-VLA, a 3D VLA policy pretrained on UniDex-Dataset and finetuned with task demonstrations. In addition, we build UniDex-Cap, a simple portable capture setup that records synchronized RGB-D streams and human hand poses and converts them into robot-executable trajectories to enable human-robot data co-training that reduces reliance on costly robot demonstrations. On challenging tool-use tasks across two different hands, UniDex-VLA achieves 81% average task progress and outperforms prior VLA baselines by a large margin, while exhibiting strong spatial, object, and zero-shot cross-hand generalization. Together, UniDex-Dataset, UniDex-VLA, and UniDex-Cap provide a scalable foundation suite for universal dexterous manipulation.