Semi-intrusive audio evaluation: Casting non-intrusive assessment as a multi-modal text prediction task

📅 2024-09-21

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

Existing audio quality assessment models struggle to emulate human selective auditory attention in complex, multi-source acoustic environments, resulting in suboptimal prediction accuracy for Mean Opinion Score (MOS) and Signal-to-Noise Ratio (SNR). To address this, we propose a semi-invasive multimodal audio assessment paradigm that frames MOS/SNR estimation as an instruction-driven text generation task conditioned on audio-text inputs. Our key contributions are: (1) the first semi-invasive evaluation framework enabling targeted quality analysis of specific sound sources; (2) a novel SNR estimator designed to focus selectively on the target source; and (3) a PENGI-based architecture integrating an audio encoder and a text decoder, enhanced via instruction tuning to acquire human-like selective listening capabilities. Experiments show our method achieves Pearson correlation improvements of +0.06 and +0.20 over MOSRA and PAM for MOS prediction, respectively, and significantly outperforms both random baselines and fixed-prompt methods in SNR estimation.

Technology Category

Application Category

📝 Abstract

Human perception has the unique ability to focus on specific events in a mixture of signals--a challenging task for existing non-intrusive assessment methods. In this work, we introduce semi-intrusive assessment that emulates human attention by framing audio assessment as a text-prediction task with audio-text inputs. To this end, we extend the multi-modal PENGI model through instruction fine-tuning for MOS and SNR estimation. For MOS, our approach achieves absolute Pearson correlation gains of 0.06 and 0.20 over the re-trained MOSRA model and the pre-trained PAM model, respectively. We further propose a novel SNR estimator that can focus on a specific audio source in a mixture, outperforming a random baseline and the fixed-prompt counterpart. Our findings suggest that semi-intrusive assessment can effectively capture human-like selective listening capabilities. Samples are available at https://jozefcoldenhoff.github.io/semi-intrusive-assessment.

Problem

Research questions and friction points this paper is trying to address.

Audio Quality Assessment

Complex Acoustic Environment

Human Auditory Attention Simulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-Intrusive Evaluation

PENGI Model Improvement

Selective Auditory Assessment

🔎 Similar Papers

A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models