Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation

📅 2025-08-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-to-3D-gesture (A2G) evaluation metrics—such as Fréchet Gesture Distance and beat alignment—exhibit significant misalignment with human perceptual preferences. To address this, we introduce the first human-preference-oriented, multidimensional quality assessment dataset (1,400 samples), featuring fine-grained annotations across three dimensions: gesture quality, audio–gesture temporal alignment, and emotion congruence. We propose a triple-branch multimodal Transformer that separately encodes 3D skeletal motion, audio, and video modalities to produce interpretable, dimension-specific quality scores. The model constrains latent representations via Fréchet distance regularization and is trained end-to-end using joint optimization over dimensional regression losses and binary preference classification. Experiments demonstrate state-of-the-art performance on our benchmark dataset; ablation studies validate the efficacy of each modality branch and confirm model robustness under varying input conditions.

Technology Category

Application Category

📝 Abstract
The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.
Problem

Research questions and friction points this paper is trying to address.

Evaluating human preference in AI-generated 3D gestures
Developing objective quality metrics for audio-to-gesture generation
Assessing audio-gesture consistency and emotional alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal transformer network with three branches
Video audio and 3D skeleton modalities integration
Multidimensional scoring for gesture quality assessment
🔎 Similar Papers
No similar papers found.
Z
Zhilin Gao
Shanghai Jiao Tong University, Shanghai, China
Y
Yunhao Li
Shanghai Jiao Tong University, Shanghai, China
S
Sijing Wu
Shanghai Jiao Tong University, Shanghai, China
Yuqin Cao
Yuqin Cao
Shanghai Jiao Tong University
Huiyu Duan
Huiyu Duan
Shanghai Jiao Tong University
Multimedia Signal Processing
Guangtao Zhai
Guangtao Zhai
Professor, IEEE Fellow, Shanghai Jiao Tong University
Multimedia Signal ProcessingVisual Quality AssessmentQoEAI EvaluationDisplays