Dexora: Open-source VLA for High-DoF Bimanual Dexterity

πŸ“… 2026-05-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

213K/year
πŸ€– AI Summary
Existing vision-language-action (VLA) systems struggle to support high-degree-of-freedom, dexterous bimanual manipulation and lack open-source, end-to-end frameworks. This work proposes the first open-source VLA system tailored for such tasks, leveraging a hybrid teleoperation setup that combines an exoskeleton backpack with the Apple Vision Pro, alongside a MuJoCo-based digital twin environment to collect large-scale real-world and simulated data. The system introduces a discriminator-based, data quality-aware weighting mechanism to optimize training of diffusion Transformer policies. Evaluated on dexterous manipulation tasks, it achieves an average success rate of 66.7%β€”a 15% improvement over the baselineβ€”and 90% on basic tasks, while demonstrating strong out-of-distribution generalization and cross-embodiment transfer capabilities.
πŸ“ Abstract
Vision-Language-Action (VLA) models have recently become a central direction in embodied AI, but current systems are restricted to either dual-gripper control or single-arm dexterous hand manipulation. While low-dimensional gripper control can often be handled with simpler methods, high-dimensional dexterous hand control benefits greatly from full end-to-end VLA learning. In this work, we introduce Dexora, the first open-source VLA system that natively targets dual-arm, dual-hand high-DoF manipulation. We design a hybrid teleoperation pipeline that decouples gross arm kinematics (captured with a custom exoskeleton backpack) from fine finger motion (markerless hand tracking via Apple Vision Pro), and that drives both a physical dual-arm dual-hand platform and an identical MuJoCo digital twin. Using that interface, we assemble a large training corpus: an embodiment-matched synthetic corpus (100K simulated trajectories, 6.5M frames) and a real-world dataset of 10K teleoperated episodes (2.92M frames). To mitigate noisy teleoperation demonstrations, we propose a data-quality-aware training recipe: an offline discriminator provides clip-level weights for diffusion-transformer policy training, down-weighting low-quality demonstrations. Empirically, Dexora outperforms competitive VLA baselines on both basic and dexterous benchmarks (e.g., average dexterous success 66.7% vs. 51.7%), attains 90% success on basic tasks, and shows robust out-of-distribution and cross-embodiment generalization. Ablations confirm the importance of real data and the discriminator for dexterity.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action
dexterous manipulation
bimanual control
high-DoF
embodied AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)
bimanual dexterous manipulation
hybrid teleoperation
data-quality-aware training
digital twin
πŸ”Ž Similar Papers
No similar papers found.
Z
Zongzheng Zhang
Tsinghua University
J
Jingrui Pang
Tsinghua University
Zhuo Yang
Zhuo Yang
Xidian University & Shanghai AI Laboratory
Lauge Language ModelAI for Science
K
Kun Li
Beijing Academy of Artificial Intelligence
M
Minwen Liao
Tsinghua University
Saining Zhang
Saining Zhang
College of Computing and Data Science, Nanyang Technological University
Computer Vision
Guoxuan Chi
Guoxuan Chi
Tsinghua University
Mobile ComputingWireless SensingSpatial Intelligence
J
Jinbang Guo
Beijing Academy of Artificial Intelligence
Huan-ang Gao
Huan-ang Gao
Ph.D. student, Tsinghua University
AgentVision & Robotics
Modi Shi
Modi Shi
Beihang University
embodied ai
D
Dongyun Ge
Tsinghua University
Y
Yao Mu
Shanghai Jiao Tong University
Jiayuan Gu
Jiayuan Gu
Assistant Professor, ShanghaiTech University
Embodied AI3D Vision
Rui Chen
Rui Chen
Tsinghua University
3D computer visionTactile Sensing
Hao Dong
Hao Dong
Tenured Associate Professor at Peking University
Embodied AIRobotics3D VisionRobot LearningReinforcement Learning
Huazhe Xu
Huazhe Xu
Tsinghua University
Embodied AIReinforcement LearningComputer VisionDeep Learning
Li Yi
Li Yi
Tsinghua University
Computer VisionComputer GraphicsGeometry ProcessingMachine Learning
Yixin Zhu
Yixin Zhu
Assistant Professor, Peking University
Computer VisionVisual ReasoningHuman-Robot Teaming
Hang Zhao
Hang Zhao
Assistant Professor, Tsinghua University
Multimodal LearningAutonomous DrivingRobot LearningEmbodied AI
Pengwei Wang
Pengwei Wang
University of Calgary
Computer Science Security
Shanghang Zhang
Shanghang Zhang
Peking University
Embodied AIFoundation Models
G
Guocai Yao
Beijing Academy of Artificial Intelligence
Jianyu Chen
Jianyu Chen
Assistant Professor, Tsinghua University
AIRobotics
Hongyang Li
Hongyang Li
Assistant Professor, University of Hong Kong
Computer VisionAutonomous DrivingRobotics
Hao Zhao
Hao Zhao
Tsinghua University
Computer Vision