🤖 AI Summary
This work addresses the challenge of combinatorial explosion in bimanual manipulation, where naively combining single-arm skills leads to poor generalization and redundant learning. To overcome this, we propose the first bimanual vision-language-action (VLA) model that explicitly models skill reuse by decoupling the representations of left- and right-arm skills, enabling efficient recombination of learned single-arm skills in novel bimanual configurations. Our approach employs a modular architecture that supports joint vision-language-action modeling and achieves substantial improvements in task success without retraining—raising performance from 0% to 51% on compositional tasks—while demonstrating strong generalization in collaborative and long-horizon scenarios.
📝 Abstract
Recent progress in vision-language-action (VLA) models has demonstrated strong potential for dual-arm manipulation, enabling complex behaviors and generalization to unseen environments. However, mainstream bimanual VLA formulations largely overlook the critical challenge of combinatorial diversity. Different pairings of single-arm behaviors can induce qualitatively distinct task behaviors, yet existing models do not explicitly account for this structure. We argue that effective bimanual VLAs should support skill reuse - the ability to recombine previously learned single-arm skills across novel left-right pairings - thereby avoiding the need to separately learn every possible combination. Current VLA designs entangle skills across arms, preventing such recomposition and limiting scalability. To address this limitation, we propose SkillVLA, a framework explicitly designed to enable skill reuse in dual-arm manipulation. Extensive experiments demonstrate that SkillVLA substantially improves skill composition, increasing overall success rate from 0% to 51%, and achieves strong performance on cooperative and long-horizon tasks.