MG-Former: A Transformer-Based Framework for Music-Driven 3D Conducting Gesture Generation

📅 2026-05-01
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This study addresses the challenging cross-modal synthesis task of generating expressive 3D conducting gestures driven by music, which requires maintaining long-range structural coherence, beat synchronization, and motion plausibility. To this end, the authors propose TransConductor, a Transformer-based framework that integrates a cross-temporal music encoder with an autoregressive gesture decoder to generate SMPL-parameterized motions from audio features and an initial pose. Key contributions include the construction of ConductorMotion—the first professional-level dataset for conducting motion—design of an alignment loss to enhance artistic consistency between music and gesture, and introduction of a retrieval-based evaluation model operating in a shared embedding space. Experiments demonstrate that the proposed method significantly outperforms existing baselines in terms of motion realism, musical alignment, and diversity.
📝 Abstract
Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-motion studies are often limited by sparse pose representations, small-scale data, or evaluation protocols that do not directly measure whether music and gesture are mutually aligned. This paper presents TransConductor, a Transformer-based framework for music-driven conducting gesture generation. We introduce ConductorMotion, a SMPL-parameter data construction pipeline that recovers detailed body motion from conducting videos and forms a dataset targeted at professional conducting gestures. Given acoustic descriptors extracted from audio and an initial pose, TransConductor uses a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder to autoregressively predict SMPL pose parameters. To better assess artistic correspondence, we further build a retrieval-based evaluation model that embeds music and gestures into a shared space and yields FID, modality distance, multi-modality distance, and diversity metrics. Experiments show that TransConductor outperforms dance-generation and conducting-generation baselines, while ablations verify the benefits of the Transformer backbone and the proposed alignment loss.
Problem

Research questions and friction points this paper is trying to address.

music-driven gesture generation
3D conducting motion
cross-modal synthesis
motion-music alignment
expressive gesture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based generation
SMPL-parameter motion
music-gesture alignment
cross-modal synthesis
conductor gesture dataset
🔎 Similar Papers