ELGAR: Expressive Cello Performance Motion Generation for Audio Rendition

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenge of synthesizing high-fidelity, full-body cello-playing motions from monaural music audio—the first diffusion-based motion synthesis framework tailored for string instruments. Methodologically: (1) it introduces Hand–Instrument Contact Loss (HICL) and Bow–String Contact Loss (BICL) to explicitly model physical interactions between performer and instrument; (2) it proposes domain-specific evaluation metrics—finger-to-string distance, bow-to-string distance, and bowing segmentation—to quantify expressive playing dynamics; and (3) it establishes SPD-GEN, the first standardized motion-capture dataset for cello performance generation. Experiments demonstrate that our approach significantly outperforms prior methods in motion realism, music–motion semantic alignment, and modeling of rapid passages. The framework provides a robust foundation for virtual performance animation, intelligent music education, and interactive digital arts.

Technology Category

Application Category

📝 Abstract
The art of instrument performance stands as a vivid manifestation of human creativity and emotion. Nonetheless, generating instrument performance motions is a highly challenging task, as it requires not only capturing intricate movements but also reconstructing the complex dynamics of the performer-instrument interaction. While existing works primarily focus on modeling partial body motions, we propose Expressive ceLlo performance motion Generation for Audio Rendition (ELGAR), a state-of-the-art diffusion-based framework for whole-body fine-grained instrument performance motion generation solely from audio. To emphasize the interactive nature of the instrument performance, we introduce Hand Interactive Contact Loss (HICL) and Bow Interactive Contact Loss (BICL), which effectively guarantee the authenticity of the interplay. Moreover, to better evaluate whether the generated motions align with the semantic context of the music audio, we design novel metrics specifically for string instrument performance motion generation, including finger-contact distance, bow-string distance, and bowing score. Extensive evaluations and ablation studies are conducted to validate the efficacy of the proposed methods. In addition, we put forward a motion generation dataset SPD-GEN, collated and normalized from the MoCap dataset SPD. As demonstrated, ELGAR has shown great potential in generating instrument performance motions with complicated and fast interactions, which will promote further development in areas such as animation, music education, interactive art creation, etc.
Problem

Research questions and friction points this paper is trying to address.

Generating whole-body cello performance motions from audio
Ensuring authenticity of performer-instrument interaction dynamics
Evaluating motion-music alignment for string instruments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based framework for whole-body motion generation
Hand and Bow Interactive Contact Losses (HICL, BICL)
Novel metrics for string instrument performance evaluation
🔎 Similar Papers
No similar papers found.
Z
Zhiping Qiu
Central Conservatory of Music, China and Tsinghua University, China
Y
Yitong Jin
Central Conservatory of Music, China and Tsinghua University, China
Y
Yuan Wang
Central Conservatory of Music, China
Y
Yi Shi
Central Conservatory of Music, China and Tsinghua University, China
C
Chongwu Wang
Central Conservatory of Music, China
Chao Tan
Chao Tan
Professor, Tianjin University
multiphase flow measurementprocess tomographymultisensor fusion
Xiaobing Li
Xiaobing Li
University of Wisconsin-Madison SUNY College of Optometry
saccade attention decision making
Feng Yu
Feng Yu
University of Exeter
Efficient AIContinual LearningFederated LearningFoundation Model
T
Tao Yu
Tsinghua University, China
Q
Qionghai Dai
Tsinghua University, China