🤖 AI Summary
This study addresses the SingMOS prediction task for synthetic singing voice quality assessment. We systematically demonstrate, for the first time, the superiority of speaker pre-trained models—specifically x-vector and ECAPA—over other speech- and music-domain pre-trained models. To fully exploit complementary information from heterogeneous representations, we propose BATCH, a novel multi-model fusion framework grounded in the Bhattacharya distance, enabling interpretable, cross-modal feature weighting. Experiments across multiple public SingMOS datasets show that BATCH consistently outperforms all single-model baselines and state-of-the-art fusion methods, establishing new SOTA performance. Our key contributions are: (1) empirical validation that speaker verification pre-training effectively transfers to singing quality modeling; and (2) a lightweight, interpretable, and high-performing fusion paradigm driven by Bhattacharya distance.
📝 Abstract
In this study, we focus on Singing Voice Mean Opinion Score (SingMOS) prediction. Previous research have shown the performance benefit with the use of state-of-the-art (SOTA) pre-trained models (PTMs). However, they haven't explored speaker recognition speech PTMs (SPTMs) such as x-vector, ECAPA and we hypothesize that it will be the most effective for SingMOS prediction. We believe that due to their speaker recognition pre-training, it equips them to capture fine-grained vocal features (e.g., pitch, tone, intensity) from synthesized singing voices in a much more better way than other PTMs. Our experiments with SOTA PTMs including SPTMs and music PTMs validates the hypothesis. Additionally, we introduce a novel fusion framework, BATCH that uses Bhattacharya Distance for fusion of PTMs. Through BATCH with the fusion of speaker recognition SPTMs, we report the topmost performance comparison to all the individual PTMs and baseline fusion techniques as well as setting SOTA.