🤖 AI Summary
Existing MOS prediction methods predominantly rely on single-granularity pooling, failing to jointly capture both global structural properties and locally salient features of speech quality. To address this limitation, we propose the Dual-Resolution Attentional Statistical Pooling (DRASP) framework, which— for the first time—integrates coarse-grained global statistical aggregation with fine-grained, attention-driven modeling of critical speech segments. This dual-resolution design enables more comprehensive and robust representation learning for speech quality assessment. DRASP is modular and plug-and-play, compatible with diverse audio front-ends and mainstream MOS prediction architectures. Extensive experiments across multiple standard datasets demonstrate that DRASP achieves a 10.39% absolute improvement in system-level Spearman’s rank correlation coefficient (SRCC) over average pooling, significantly outperforming existing baselines. Moreover, it exhibits strong generalization capability across different models and unseen datasets.
📝 Abstract
A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable-length audio features into a concise fixed-size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame-level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments. This dual-view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES-Natural), MOS prediction backbones (including a CLAP-based model and AudioBox-Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system-level Spearman's rank correlation coefficient (SRCC) over the widely-used average pooling approach.