Investigating the Reasonable Effectiveness of Speaker Pre-Trained Models and their Synergistic Power for SingMOS Prediction

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study addresses the SingMOS prediction task for synthetic singing voice quality assessment. We systematically demonstrate, for the first time, the superiority of speaker pre-trained models—specifically x-vector and ECAPA—over other speech- and music-domain pre-trained models. To fully exploit complementary information from heterogeneous representations, we propose BATCH, a novel multi-model fusion framework grounded in the Bhattacharya distance, enabling interpretable, cross-modal feature weighting. Experiments across multiple public SingMOS datasets show that BATCH consistently outperforms all single-model baselines and state-of-the-art fusion methods, establishing new SOTA performance. Our key contributions are: (1) empirical validation that speaker verification pre-training effectively transfers to singing quality modeling; and (2) a lightweight, interpretable, and high-performing fusion paradigm driven by Bhattacharya distance.

Technology Category

Application Category

📝 Abstract

In this study, we focus on Singing Voice Mean Opinion Score (SingMOS) prediction. Previous research have shown the performance benefit with the use of state-of-the-art (SOTA) pre-trained models (PTMs). However, they haven't explored speaker recognition speech PTMs (SPTMs) such as x-vector, ECAPA and we hypothesize that it will be the most effective for SingMOS prediction. We believe that due to their speaker recognition pre-training, it equips them to capture fine-grained vocal features (e.g., pitch, tone, intensity) from synthesized singing voices in a much more better way than other PTMs. Our experiments with SOTA PTMs including SPTMs and music PTMs validates the hypothesis. Additionally, we introduce a novel fusion framework, BATCH that uses Bhattacharya Distance for fusion of PTMs. Through BATCH with the fusion of speaker recognition SPTMs, we report the topmost performance comparison to all the individual PTMs and baseline fusion techniques as well as setting SOTA.

Problem

Research questions and friction points this paper is trying to address.

Exploring speaker pre-trained models for SingMOS prediction

Validating effectiveness of speaker recognition PTMs for vocal features

Introducing BATCH fusion framework for superior SingMOS performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses speaker recognition pre-trained models (SPTMs)

Introduces BATCH fusion framework with Bhattacharya Distance

Validates SPTMs outperform music PTMs for SingMOS

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations