Singing Timbre Popularity Assessment Based on Multimodal Large Foundation Model

๐Ÿ“… 2025-12-07
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current automatic singing evaluation systems rely on reference audio tracks and produce only a single pitch/rhythm score, thereby suppressing artistic expression and lacking diagnostic capability. To address these limitations, we propose the first reference-free, multi-dimensional singing assessment framework. We introduce Sing-MD, an expert-annotated dataset covering breath control, timbre quality, emotional expression, and vocal technique. We design H-TPR, a humanโ€“machine collaborative ranking benchmark, and VocalVerse, a hybrid architecture that mitigates scoring noise and long-audio modeling challenges. Our approach integrates a lightweight acoustic encoder with a multimodal large language model to enable holistic feature representation and long-range temporal dependency analysis. Experiments demonstrate that our method supports full-song-level evaluation and significantly outperforms conventional metrics in multi-dimensional timbre popularity ranking. This advances automatic singing evaluation from discriminative scoring toward interpretable, descriptive assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Automated singing assessment is crucial for education and entertainment. However, existing systems face two fundamental limitations: reliance on reference tracks, which stifles creative expression, and the simplification of complex performances into non-diagnostic scores based solely on pitch and rhythm. We advocate for a shift from discriminative to descriptive evaluation, creating a complete ecosystem for reference-free, multi-dimensional assessment. First, we introduce Sing-MD, a large-scale dataset annotated by experts across four dimensions: breath control, timbre quality, emotional expression, and vocal technique. Our analysis reveals significant annotation inconsistencies among experts, challenging the validity of traditional accuracy-based metrics. Second, addressing the memory limitations of Multimodal Large Language Models (MLLMs) in analyzing full-length songs, we propose VocalVerse. This efficient hybrid architecture leverages a lightweight acoustic encoder to model global performance features and long-term dependencies. Third, to address automated metric shortcomings, we establish the H-TPR (Human-in-the-loop Tiered Perceptual Ranking) benchmark, which evaluates a model's ability to generate perceptually valid rankings rather than predicting noisy ground-truth scores.
Problem

Research questions and friction points this paper is trying to address.

Develops a reference-free singing assessment system
Addresses memory limitations in analyzing full songs
Establishes a human-in-the-loop perceptual ranking benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reference-free multimodal large foundation model
Lightweight acoustic encoder for long-term dependencies
Human-in-the-loop perceptual ranking benchmark
๐Ÿ”Ž Similar Papers
No similar papers found.
Z
Zihao Wang
Zhejiang University, Hangzhou, China; Carnegie Mellon University, Pittsburgh, United States
Ruibin Yuan
Ruibin Yuan
HKUST
Artificial IntelligenceMusic GenerationMusic Information RetrievalComputer Music
Z
Ziqi Geng
University of California, Berkeley, Berkeley, United States
Hengjia Li
Hengjia Li
Zhejiang University
image generationvideo generation
X
Xingwei Qu
University of Manchester, Manchester, United Kingdom
X
Xinyi Li
Zhejiang University, Hangzhou, China
S
Songye Chen
Mei KTV, Beijing, China
H
Haoying Fu
Mei KTV, Beijing, China
Roger B. Dannenberg
Roger B. Dannenberg
Professor of Computer Science, Carnegie Mellon University
Computer Music
K
Kejun Zhang
Zhejiang University, Hangzhou, China; Innovation Center of Yangtze River Delta, Zhejiang University, Hangzhou, China