🤖 AI Summary
Existing audio quality assessment models struggle to emulate human selective auditory attention in complex, multi-source acoustic environments, resulting in suboptimal prediction accuracy for Mean Opinion Score (MOS) and Signal-to-Noise Ratio (SNR). To address this, we propose a semi-invasive multimodal audio assessment paradigm that frames MOS/SNR estimation as an instruction-driven text generation task conditioned on audio-text inputs. Our key contributions are: (1) the first semi-invasive evaluation framework enabling targeted quality analysis of specific sound sources; (2) a novel SNR estimator designed to focus selectively on the target source; and (3) a PENGI-based architecture integrating an audio encoder and a text decoder, enhanced via instruction tuning to acquire human-like selective listening capabilities. Experiments show our method achieves Pearson correlation improvements of +0.06 and +0.20 over MOSRA and PAM for MOS prediction, respectively, and significantly outperforms both random baselines and fixed-prompt methods in SNR estimation.
📝 Abstract
Human perception has the unique ability to focus on specific events in a mixture of signals--a challenging task for existing non-intrusive assessment methods. In this work, we introduce semi-intrusive assessment that emulates human attention by framing audio assessment as a text-prediction task with audio-text inputs. To this end, we extend the multi-modal PENGI model through instruction fine-tuning for MOS and SNR estimation. For MOS, our approach achieves absolute Pearson correlation gains of 0.06 and 0.20 over the re-trained MOSRA model and the pre-trained PAM model, respectively. We further propose a novel SNR estimator that can focus on a specific audio source in a mixture, outperforming a random baseline and the fixed-prompt counterpart. Our findings suggest that semi-intrusive assessment can effectively capture human-like selective listening capabilities. Samples are available at https://jozefcoldenhoff.github.io/semi-intrusive-assessment.