🤖 AI Summary
Existing intrusive speech intelligibility predictors rely on explicit reference signals but suffer from suboptimal utilization of speech foundation models (SFMs), limiting their performance. This paper proposes a reference-conditioned, multi-layer SFM joint modeling framework that enables fine-grained intelligibility prediction via reference-aligned feature extraction, hierarchical SFM feature fusion, and deep regression modeling. Our key contribution is the introduction of a reference-aware mechanism—the first systematic effort to unlock the representational potential of SFMs for intrusive intelligibility assessment—thereby establishing a novel reference-driven paradigm. Evaluated on the CPC3 challenge, our method achieves state-of-the-art performance: RMSE of 22.36 on the development set and 24.98 on the test set, significantly outperforming all existing intrusive approaches.
📝 Abstract
Intrusive speech-intelligibility predictors that exploit explicit reference signals are now widespread, yet they have not consistently surpassed non-intrusive systems. We argue that a primary cause is the limited exploitation of speech foundation models (SFMs). This work revisits intrusive prediction by combining reference conditioning with multi-layer SFM representations. Our final system achieves RMSE 22.36 on the development set and 24.98 on the evaluation set, ranking 1st on CPC3. These findings provide practical guidance for constructing SFM-based intrusive intelligibility predictors.