TokenDance: Token-to-Token Music-to-Dance Generation with Bidirectional Mamba

📅 2026-03-28

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing 3D dance datasets suffer from limited coverage, resulting in generated dances that lack diversity, expressiveness, and alignment with music. To address this, this work proposes a two-stage non-autoregressive music-to-dance generation framework. The first stage employs decoupled upper- and lower-body motion quantization along with a dual-codebook architecture that captures both semantic and acoustic features of music, enabling effective cross-modal discretization. The second stage introduces a bidirectional Mamba-based local-global-local token generator to efficiently synthesize high-fidelity dance sequences. By integrating finite scalar quantization with kinematic and dynamic constraints, the proposed method achieves state-of-the-art performance in terms of generation quality, musical synchronization, and inference speed, significantly outperforming existing approaches.

Technology Category

Application Category

📝 Abstract

Music-to-dance generation has broad applications in virtual reality, dance education, and digital character animation. However, the limited coverage of existing 3D dance datasets confines current models to a narrow subset of music styles and choreographic patterns, resulting in poor generalization to real-world music. Consequently, generated dances often become overly simplistic and repetitive, substantially degrading expressiveness and realism. To tackle this problem, we present TokenDance, a two-stage music-to-dance generation framework that explicitly addresses this limitation through dual-modality tokenization and efficient token-level generation. In the first stage, we discretize both dance and music using Finite Scalar Quantization, where dance motions are factorized into upper and lower-body components with kinematic-dynamic constraints, and music is decomposed into semantic and acoustic features with dedicated codebooks to capture choreography-specific structures. In the second stage, we introduce a Local-Global-Local token-to-token generator built on a Bidirectional Mamba backbone, enabling coherent motion synthesis, strong music-dance alignment, and efficient non-autoregressive inference. Extensive experiments demonstrate that TokenDance achieves overall state-of-the-art (SOTA) performance in both generation quality and inference speed, highlighting its effectiveness and practical value for real-world music-to-dance applications.

Problem

Research questions and friction points this paper is trying to address.

music-to-dance generation

3D dance dataset

generalization

expressiveness

realism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-to-Token Generation

Bidirectional Mamba

Finite Scalar Quantization