🤖 AI Summary
This work proposes RigMo, the first end-to-end unsupervised framework that jointly learns explicit skeletal structures and dynamic motion directly from raw mesh sequences, eliminating the need for manually annotated rigging or skinning weights. Traditional 4D generation methods decouple rigging from motion modeling, which hinders scalability and interpretability. RigMo introduces a dual latent space for rig and motion, an explicit Gaussian-based skeletal representation, structure-aware encoding, and SE(3) transformation sequence modeling, further integrated with Motion-DiT for downstream motion generation. Evaluated on DeformingThings4D, Objaverse-XL, and TrueBones, RigMo achieves superior reconstruction quality, strong cross-category generalization, and generates smooth, physically plausible, and interpretable animatable 3D objects.
📝 Abstract
Despite significant progress in 4D generation, rig and motion, the core structural and dynamic components of animation are typically modeled as separate problems. Existing pipelines rely on ground-truth skeletons and skinning weights for motion generation and treat auto-rigging as an independent process, undermining scalability and interpretability. We present RigMo, a unified generative framework that jointly learns rig and motion directly from raw mesh sequences, without any human-provided rig annotations. RigMo encodes per-vertex deformations into two compact latent spaces: a rig latent that decodes into explicit Gaussian bones and skinning weights, and a motion latent that produces time-varying SE(3) transformations. Together, these outputs define an animatable mesh with explicit structure and coherent motion, enabling feed-forward rig and motion inference for deformable objects. Beyond unified rig-motion discovery, we introduce a Motion-DiT model operating in RigMo's latent space and demonstrate that these structure-aware latents can naturally support downstream motion generation tasks. Experiments on DeformingThings4D, Objaverse-XL, and TrueBones demonstrate that RigMo learns smooth, interpretable, and physically plausible rigs, while achieving superior reconstruction and category-level generalization compared to existing auto-rigging and deformation baselines. RigMo establishes a new paradigm for unified, structure-aware, and scalable dynamic 3D modeling.