🤖 AI Summary
This work addresses the challenging problem of automatic rigging and animation generation for static 3D models. We propose the first end-to-end differentiable framework that jointly predicts skeletal structures, infers skinning weights, and synthesizes high-fidelity animations. Methodologically, we design a joint-level autoregressive Transformer incorporating topology-aware joint attention and hierarchical sequential modeling, augmented with stochastic perturbations for robust learning. A unified differentiable rendering and optimization pipeline co-optimizes skeletal geometry, skinning quality, and temporal animation coherence. Evaluated on multiple benchmarks, our approach significantly outperforms state-of-the-art methods: it reduces skeletal prediction error by 21.3%, improves skinning quality by 18.7%, generates temporally coherent, jitter-free animations, and achieves a 35% speedup in inference time.
📝 Abstract
Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.