Skin Tokens: A Learned Compact Representation for Unified Autoregressive Rigging

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the inefficiency, poor generalization, and decoupling from skeleton generation inherent in existing automatic rigging pipelines for 3D generative models, where skinning weights are modeled as high-dimensional continuous regression tasks. To overcome these limitations, the authors propose SkinTokens—a compact, discrete representation of skinning weights learned via a finite scalar quantized conditional variational autoencoder (FSQ-CVAE) that exploits the intrinsic sparsity of skinning fields. They further introduce TokenRig, a unified autoregressive framework that jointly models skeletal parameters and SkinTokens as a single sequence, enabling end-to-end co-generation of skeletons and skinning. Reinforcement learning with geometric and semantic rewards is employed to refine the joint prediction. Experiments demonstrate a 98%–133% improvement in skinning accuracy over current methods and a 17%–22% gain in skeleton prediction accuracy after reinforcement learning optimization, significantly enhancing generalization to complex and out-of-distribution assets.

Technology Category

Application Category

📝 Abstract

The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is typically decoupled from skeleton generation. We posit this is a representation problem and introduce SkinTokens: a learned, compact, and discrete representation for skinning weights. By leveraging an FSQ-CVAE to capture the intrinsic sparsity of skinning, we reframe the task from continuous regression to a more tractable token sequence prediction problem. This representation enables TokenRig, a unified autoregressive framework that models the entire rig as a single sequence of skeletal parameters and SkinTokens, learning the complicated dependencies between skeletons and skin deformations. The unified model is then amenable to a reinforcement learning stage, where tailored geometric and semantic rewards improve generalization to complex, out-of-distribution assets. Quantitatively, the SkinTokens representation leads to a 98%-133% percents improvement in skinning accuracy over state-of-the-art methods, while the full TokenRig framework, refined with RL, enhances bone prediction by 17%-22%. Our work presents a unified, generative approach to rigging that yields higher fidelity and robustness, offering a scalable solution to a long-standing challenge in 3D content creation.

Problem

Research questions and friction points this paper is trying to address.

rigging

skinning

3D animation

representation learning

autoregressive modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

SkinTokens

autoregressive rigging

discrete representation