GACA-DiT: Diffusion-based Dance-to-Music Generation with Genre-Adaptive Rhythm and Context-Aware Alignment

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dance-to-music (D2M) generation methods rely on coarse-grained rhythmic representations—such as global motion features or binarized joint-level rhythm—leading to loss of fine-grained motion cues and inaccurate rhythmic alignment; additionally, feature downsampling introduces temporal misalignment, degrading temporal synchrony. To address these issues, we propose a style-adaptive, fine-grained rhythmic modeling and context-aware alignment framework. Our method introduces a genre-adaptive rhythm extraction module that fuses multi-scale wavelet analysis with spatial phase histograms to capture nuanced rhythmic details. We further design a context-aware temporal alignment module incorporating learnable contextual queries for precise frame-level synchronization. The entire architecture is built upon a diffusion-based Transformer. Evaluated on AIST++ and TikTok datasets, our approach significantly outperforms state-of-the-art methods, achieving consistent improvements in Fréchet Audio Distance (FAD), KL divergence, and human perceptual ratings.

Technology Category

Application Category

📝 Abstract
Dance-to-music (D2M) generation aims to automatically compose music that is rhythmically and temporally aligned with dance movements. Existing methods typically rely on coarse rhythm embeddings, such as global motion features or binarized joint-based rhythm values, which discard fine-grained motion cues and result in weak rhythmic alignment. Moreover, temporal mismatches introduced by feature downsampling further hinder precise synchronization between dance and music. To address these problems, we propose extbf{GACA-DiT}, a diffusion transformer-based framework with two novel modules for rhythmically consistent and temporally aligned music generation. First, a extbf{genre-adaptive rhythm extraction} module combines multi-scale temporal wavelet analysis and spatial phase histograms with adaptive joint weighting to capture fine-grained, genre-specific rhythm patterns. Second, a extbf{context-aware temporal alignment} module resolves temporal mismatches using learnable context queries to align music latents with relevant dance rhythm features. Extensive experiments on the AIST++ and TikTok datasets demonstrate that GACA-DiT outperforms state-of-the-art methods in both objective metrics and human evaluation. Project page: https://beria-moon.github.io/GACA-DiT/.
Problem

Research questions and friction points this paper is trying to address.

Generating music rhythmically aligned with dance movements
Overcoming coarse rhythm embeddings causing weak rhythmic alignment
Resolving temporal mismatches from feature downsampling in synchronization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Genre-adaptive rhythm extraction with multi-scale wavelet analysis
Context-aware alignment using learnable queries for synchronization
Diffusion transformer framework for rhythmically consistent music generation
🔎 Similar Papers
No similar papers found.
Jinting Wang
Jinting Wang
Central University of Finance and Economics
Operations ManagementService ScienceQueueing TheoryReliabilityStochastic Modeling
C
Chenxing Li
Tencent AI Lab
L
Li Liu
The Hong Kong University of Science and Technology (Guangzhou)