π€ AI Summary
This work addresses the inefficiency, poor generalization, and decoupling from skeleton generation inherent in existing automatic rigging pipelines for 3D generative models, where skinning weights are modeled as high-dimensional continuous regression tasks. To overcome these limitations, the authors propose SkinTokensβa compact, discrete representation of skinning weights learned via a finite scalar quantized conditional variational autoencoder (FSQ-CVAE) that exploits the intrinsic sparsity of skinning fields. They further introduce TokenRig, a unified autoregressive framework that jointly models skeletal parameters and SkinTokens as a single sequence, enabling end-to-end co-generation of skeletons and skinning. Reinforcement learning with geometric and semantic rewards is employed to refine the joint prediction. Experiments demonstrate a 98%β133% improvement in skinning accuracy over current methods and a 17%β22% gain in skeleton prediction accuracy after reinforcement learning optimization, significantly enhancing generalization to complex and out-of-distribution assets.
π Abstract
The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is typically decoupled from skeleton generation. We posit this is a representation problem and introduce SkinTokens: a learned, compact, and discrete representation for skinning weights. By leveraging an FSQ-CVAE to capture the intrinsic sparsity of skinning, we reframe the task from continuous regression to a more tractable token sequence prediction problem. This representation enables TokenRig, a unified autoregressive framework that models the entire rig as a single sequence of skeletal parameters and SkinTokens, learning the complicated dependencies between skeletons and skin deformations. The unified model is then amenable to a reinforcement learning stage, where tailored geometric and semantic rewards improve generalization to complex, out-of-distribution assets. Quantitatively, the SkinTokens representation leads to a 98%-133% percents improvement in skinning accuracy over state-of-the-art methods, while the full TokenRig framework, refined with RL, enhances bone prediction by 17%-22%. Our work presents a unified, generative approach to rigging that yields higher fidelity and robustness, offering a scalable solution to a long-standing challenge in 3D content creation.