🤖 AI Summary
To address audio-visual desynchronization in user-generated content (UGC) dance videos—caused by misalignment between musical rhythm and human motion—the paper proposes a harmony-aware generative adversarial framework for synthesizing 3D dance motions with high rhythmic fidelity. Methodologically, it introduces a novel saliency-weighted beat evaluation strategy inspired by human visual attention, integrating cross-modal beat detection, interval-driven temporal alignment, saliency-guided beat weighting, and a unified encoder-decoder architecture enhanced with a depth refinement network. It further employs weakly supervised adversarial training stratified by beat type. Crucially, interpretable harmony modeling is embedded directly into the generation process—a first in this domain. Evaluated on limited UGC data, the method achieves statistically significant improvements over state-of-the-art approaches in both Beat Consistency Score and subjective human evaluation, yielding natural, rhythmically precise, and highly audio-visually coherent motion sequences.
📝 Abstract
With the popularity of video-based user-generated content (UGC) on social media, harmony, as dictated by human perceptual principles, is critical in assessing the rhythmic consistency of audio-visual UGCs for better user engagement. In this work, we propose a novel harmony-aware GAN framework, following a specifically designed harmony evaluation strategy to enhance rhythmic synchronization in the automatic music-to-motion synthesis using a UGC dance dataset. This harmony strategy utilizes refined cross-modal beat detection to capture closely correlated audio and visual rhythms in an audio-visual pair. To mimic human attention mechanism, we introduce saliency-based beat weighting and interval-driven beat alignment, which ensures accurate harmony score estimation consistent with human perception. Building on this strategy, our model, employing efficient encoder-decoder and depth-lifting designs, is adversarially trained based on categorized musical meter segments to generate realistic and rhythmic 3D human motions. We further incorporate our harmony evaluation strategy as a weakly supervised perceptual constraint to flexibly guide the synchronized audio-visual rhythms during the generation process. Experimental results show that our proposed model significantly outperforms other leading music-to-motion methods in rhythmic harmony, both quantitatively and qualitatively, even with limited UGC training data. Live samples 15 can be watched at: https://youtu.be/tWwz7yq4aUs