Towards Human-Like Manipulation through RL-Augmented Teleoperation and Mixture-of-Dexterous-Experts VLA

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

This work addresses the limitations of existing vision-language-action (VLA) models in high-degree-of-freedom, bimanual dexterous manipulation—particularly in contact-rich in-hand tasks—stemming from scarce high-quality data and challenges in multi-skill coordination and multimodal perception fusion. To overcome these issues, the authors propose the IMCopilot-MoDE framework: first, atomic skills are trained via reinforcement learning to form an IMCopilot agent, which enables a shared-autonomy teleoperation assistance mechanism for efficient collection of high-quality demonstrations and serves as callable execution primitives; subsequently, the MoDE-VLA architecture integrates force and tactile feedback through a residual injection mechanism, enhancing the VLA model’s contact awareness without disrupting its pretrained knowledge. Experiments on four increasingly complex dexterous manipulation tasks demonstrate that the proposed method doubles the success rate over baseline approaches in contact-intensive scenarios.

Technology Category

Application Category

📝 Abstract

While Vision-Language-Action (VLA) models have demonstrated remarkable success in robotic manipulation, their application has largely been confined to low-degree-of-freedom end-effectors performing simple, vision-guided pick-and-place tasks. Extending these models to human-like, bimanual dexterous manipulation-specifically contact-rich in-hand operations-introduces critical challenges in high-fidelity data acquisition, multi-skill learning, and multimodal sensory fusion. In this paper, we propose an integrated framework to address these bottlenecks, built upon two components. First, we introduce IMCopilot (In-hand Manipulation Copilot), a suite of reinforcement learning-trained atomic skills that plays a dual role: it acts as a shared-autonomy assistant to simplify teleoperation data collection, and it serves as a callable low-level execution primitive for the VLA. Second, we present MoDE-VLA (Mixture-of-Dexterous-Experts VLA), an architecture that seamlessly integrates heterogeneous force and tactile modalities into a pretrained VLA backbone. By utilizing a residual injection mechanism, MoDE-VLA enables contact-aware refinement without degrading the model's pretrained knowledge. We validate our approach on four tasks of escalating complexity, demonstrating doubled success rate improvement over the baseline in dexterous contact-rich tasks.

Problem

Research questions and friction points this paper is trying to address.

dexterous manipulation

Vision-Language-Action models

bimanual manipulation

contact-rich tasks

teleoperation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-Language-Action (VLA)

dexterous manipulation

reinforcement learning