TwinVLA: Data-Efficient Bimanual Manipulation with Twin Single-Arm Vision-Language-Action Models

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To address the reliance of dual-arm manipulation tasks on extensive task-specific training data and fine-tuning, this paper proposes TwinVLA: a modular dual-arm coordination framework that requires neither dual-arm pretraining nor private datasets. Its core innovation lies in coupling two pretrained single-arm vision-language-action (VLA) models via parameter sharing and cross-arm attention mechanisms, enabling implicit action coordination during inference. By reusing existing single-arm policies, TwinVLA achieves high data efficiency while avoiding full model retraining or end-to-end fine-tuning. Experiments demonstrate that TwinVLA outperforms the comparable-scale monolithic model RDT-1B on both real-world and simulated dual-arm manipulation benchmarks, and approaches the performance of the state-of-the-art model π₀—which relies on massive private data—validating the feasibility and effectiveness of composing dual-arm intelligence from pretrained single-arm VLA models.

Technology Category

Application Category

📝 Abstract

Vision-language-action models (VLAs) trained on large-scale robotic datasets have demonstrated strong performance on manipulation tasks, including bimanual tasks. However, because most public datasets focus on single-arm demonstrations, adapting VLAs for bimanual tasks typically requires substantial additional bimanual data and fine-tuning. To address this challenge, we introduce TwinVLA, a modular framework that composes two copies of a pretrained single-arm VLA into a coordinated bimanual VLA. Unlike monolithic cross-embodiment models trained on mixtures of single-arm and bimanual data, TwinVLA improves both data efficiency and performance by composing pretrained single-arm policies. Across diverse bimanual tasks in real-world and simulation settings, TwinVLA outperforms a comparably-sized monolithic RDT-1B model without requiring any bimanual pretraining. Furthermore, it narrows the gap to state-of-the-art model, $pi_0$ which rely on extensive proprietary bimanual data and compute cost. These results establish our modular composition approach as a data-efficient and scalable path toward high-performance bimanual manipulation, leveraging public single-arm data.

Problem

Research questions and friction points this paper is trying to address.

Adapting vision-language-action models for bimanual tasks efficiently

Reducing dependency on large-scale bimanual training data

Composing single-arm models into coordinated bimanual manipulation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Composes two single-arm VLAs for bimanual tasks

Uses modular framework without bimanual pretraining

Improves data efficiency with public single-arm data

🔎 Similar Papers

Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks

2024-04-02IEEE/RJS International Conference on Intelligent RObots and SystemsCitations: 0

Toyota Research Institute

Los Altos, CA / Cambridge, MA

Research Scientist Intern, Robotic Control Policy (PhD)