🤖 AI Summary
This work addresses the limited flexibility of traditional robot programming and the shortcomings of existing learning-based assembly methods in positional generalization, multi-stage task design, and multi-skill integration. The authors propose an end-to-end autoregressive trajectory generation framework that directly maps multimodal inputs—including RGB-D images, natural language instructions, and proprioceptive signals—onto manipulation trajectories, thereby abandoning the conventional paradigm of decoupled perception and control. By fusing multimodal features to interpret task semantics and leveraging autoregressive sequence modeling to produce temporally coherent trajectories, the approach integrates a Mixture-of-Experts (MoE) architecture to enable efficient multi-skill learning within a single model. Evaluated on eight distinct skills for pressure relief valve assembly, the method achieves a 96.3% average grasping success rate and a 91.8% overall success rate in simulation, demonstrating strong generalization and practical applicability to real-world scenarios.
📝 Abstract
Flexible manufacturing requires robot systems that can adapt to constantly changing tasks, objects, and environments. However, traditional robot programming is labor-intensive and inflexible, while existing learning-based assembly methods often suffer from weak positional generalization, complex multi-stage designs, and limited multi-skill integration capability. To address these issues, this paper proposes ATG-MoE, an end-to-end autoregressive trajectory generation method with mixture of experts for assembly skill learning from demonstration. The proposed method establishes a closed-loop mapping from multi-modal inputs, including RGB-D observations, natural language instructions, and robot proprioception to manipulation trajectories. It integrates multi-modal feature fusion for scene and task understanding, autoregressive sequence modeling for temporally coherent trajectory generation, and a mixture-of-experts architecture for unified multi-skill learning. In contrast to conventional methods that separate visual perception and control or train different skills independently, ATG-MoE directly incorporates visual information into trajectory generation and supports efficient multi-skill integration within a single model. We train and evaluate the proposed method on eight representative assembly skills from a pressure-reducing valve assembly task. Experimental results show that ATG-MoE achieves strong overall performance in simulation, with an average grasp success rate of 96.3% and an average overall success rate of 91.8%, while also demonstrating strong generalization and effective multi-skill integration. Real-world experiments further verify its practicality for multi-skill industrial assembly. The project page can be found at https://hwh23.github.io/ATG-MoE