BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

๐Ÿ“… 2026-04-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

218K/year
๐Ÿค– AI Summary
Existing approaches are limited by the absence of large-scale, fine-grained 3D conducting motion datasets and the inability to efficiently generate long motion sequences. To address these challenges, this work introduces CM-Data, the first large-scale 3D conducting motion dataset, and proposes BiTDiff, a novel framework that integrates a BiMambaโ€“Transformer hybrid architecture with diffusion modeling. Leveraging SMPL-X body representation, kinematic decomposition, physics-based consistency constraints, and cross-modal semantic alignment, BiTDiff enables high-quality, fine-grained 3D conducting motion generation driven by music. Evaluated on CM-Data, the method achieves state-of-the-art performance, supports efficient long-sequence modeling, and allows for training-free, joint-level interactive editing.

Technology Category

Application Category

๐Ÿ“ Abstract
3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

3D conducting motion generation
fine-grained motion synthesis
long-sequence generation
data scarcity
motion quality and efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

BiMamba-Transformer
Diffusion Model
3D Conducting Motion Generation
Fine-Grained Motion Synthesis
Human-Kinematic Decomposition