DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction

📅 2025-08-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MOS prediction methods predominantly rely on single-granularity pooling, failing to jointly capture both global structural properties and locally salient features of speech quality. To address this limitation, we propose the Dual-Resolution Attentional Statistical Pooling (DRASP) framework, which— for the first time—integrates coarse-grained global statistical aggregation with fine-grained, attention-driven modeling of critical speech segments. This dual-resolution design enables more comprehensive and robust representation learning for speech quality assessment. DRASP is modular and plug-and-play, compatible with diverse audio front-ends and mainstream MOS prediction architectures. Extensive experiments across multiple standard datasets demonstrate that DRASP achieves a 10.39% absolute improvement in system-level Spearman’s rank correlation coefficient (SRCC) over average pooling, significantly outperforming existing baselines. Moreover, it exhibits strong generalization capability across different models and unseen datasets.

Technology Category

Application Category

📝 Abstract
A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable-length audio features into a concise fixed-size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame-level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments. This dual-view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES-Natural), MOS prediction backbones (including a CLAP-based model and AudioBox-Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system-level Spearman's rank correlation coefficient (SRCC) over the widely-used average pooling approach.
Problem

Research questions and friction points this paper is trying to address.

Predicting mean opinion scores from variable-length audio features
Overcoming limitations of single-granularity pooling methods
Integrating global statistics with local attentive analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-resolution pooling for audio features
Combines global statistics with attentive analysis
Captures structural context and local details
🔎 Similar Papers
No similar papers found.
C
Cheng-Yeh Yang
Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan
K
Kuan-Tang Huang
Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan
Chien-Chun Wang
Chien-Chun Wang
National Taiwan Normal University
Speech EnhancementSpeech RecognitionVoice Activity DetectionSpeech Quality Assessment
Hung-Shin Lee
Hung-Shin Lee
North Co., Ltd., Taiwan
Speech Processing
Hsin-Min Wang
Hsin-Min Wang
Research Fellow/Professor, Institute of Information Sience, Academia Sinica
Spoken Language ProcessingNatural Language ProcessingMultimedia Information RetrievalMachine Learning
B
Berlin Chen
Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan