🤖 AI Summary
This work addresses the critical need in industrial assembly tasks for joint modeling of multimodal instructions, 3D part relationships, and physically plausible six-degree-of-freedom motion trajectories. To this end, the authors introduce AssemblyBench, a large-scale synthetic dataset comprising 2,789 industrial objects, and propose AssemblyDyno, an end-to-end Transformer-based model that fuses assembly manuals with 3D point clouds to simultaneously predict assembly sequences and feasible motion trajectories. This approach is the first to integrate multimodal semantics, geometric structure, and physical constraints within industrial-scale scenarios. Simulation experiments demonstrate that AssemblyDyno significantly outperforms existing methods in both assembly pose estimation and trajectory physical plausibility, thereby substantially improving task success rates and realism.
📝 Abstract
Assembling objects from parts requires understanding multimodal instructions, linking them to 3D components, and predicting physically plausible 6-DoF motions for each assembly step. Existing datasets focus on simplified scenarios, overlooking shape complexities and assembly trajectories in industrial assemblies. We introduce AssemblyBench, a synthetic dataset of 2,789 industrial objects with multimodal instruction manuals, corresponding 3D part models, and part assembly trajectories. We also propose a transformer-based model, AssemblyDyno, which uses the instructional manual and the 3D shape of each part to jointly predict assembly order and part assembly trajectories. AssemblyDyno outperforms prior works in both assembly pose estimation and trajectory feasibility, where the latter is evaluated by our physics-based simulations.