🤖 AI Summary
This work proposes a non-intrusive speech quality assessment method leveraging large language models (LLMs) in data-scarce scenarios. It introduces the novel use of an LLM as a meta-evaluator that, through few-shot in-context learning, fuses lightweight acoustic descriptors with pseudo-labels generated by existing models such as DNSMOS and VQScore to predict perceptual Mean Opinion Scores (MOS). The key innovation lies in the LLM’s ability to aggregate heterogeneous quality signals and integrate a pseudo-label-guided mechanism, substantially enhancing assessment performance under low-resource conditions. Experimental results on the VoiceBank-DEMAND dataset demonstrate that the proposed approach outperforms current state-of-the-art models—including DNSMOS, VQScore, CNN-BLSTM, and MOS-SSL—thereby validating the efficacy of LLM-driven multi-source signal fusion for speech quality evaluation.
📝 Abstract
In this paper, we introduce GatherMOS, a novel framework that leverages large language models (LLM) as meta-evaluators to aggregate diverse signals into quality predictions. GatherMOS integrates lightweight acoustic descriptors with pseudo-labels from DNSMOS and VQScore, enabling the LLM to reason over heterogeneous inputs and infer perceptual mean opinion scores (MOS). We further explore both zero-shot and few-shot in-context learning setups, showing that zero-shot GatherMOS maintains stable performance across diverse conditions, while few-shot guidance yields large gains when support samples match the test conditions. Experiments on the VoiceBank-DEMAND dataset demonstrate that GatherMOS consistently outperforms DNSMOS, VQScore, naive score averaging, and even learning-based models such as CNN-BLSTM and MOS-SSL when trained under limited labeled-data conditions. These results highlight the potential of LLM-based aggregation as a practical strategy for non-intrusive speech quality evaluation.