🤖 AI Summary
This work addresses the limitations of existing vision-language-action (VLA) models in high-degree-of-freedom, bimanual dexterous manipulation—particularly in contact-rich in-hand tasks—stemming from scarce high-quality data and challenges in multi-skill coordination and multimodal perception fusion. To overcome these issues, the authors propose the IMCopilot-MoDE framework: first, atomic skills are trained via reinforcement learning to form an IMCopilot agent, which enables a shared-autonomy teleoperation assistance mechanism for efficient collection of high-quality demonstrations and serves as callable execution primitives; subsequently, the MoDE-VLA architecture integrates force and tactile feedback through a residual injection mechanism, enhancing the VLA model’s contact awareness without disrupting its pretrained knowledge. Experiments on four increasingly complex dexterous manipulation tasks demonstrate that the proposed method doubles the success rate over baseline approaches in contact-intensive scenarios.
📝 Abstract
While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.