π€ AI Summary
The robustness and generalization capability of neural speech codecs under realistic noisy conditions remain poorly understood.
Method: This paper presents the first systematic evaluation of mainstream modelsβ performance degradation across diverse noise types, introducing a comprehensive analytical framework integrating nonlinear distortion quantification, frequency-domain response modeling, and multi-condition speech degradation simulation.
Contribution/Results: (1) Significant architectural differences in robustness are identified, primarily attributable to heightened sensitivity of implicit nonlinear distortions to noise; (2) strong correlation is observed between high-frequency response attenuation and reduced speech intelligibility; (3) a novel, interpretable, and quantitative robustness assessment paradigm is proposed, grounded in frequency-response features. These findings provide theoretically grounded, measurable insights into the underlying mechanisms of codec robustness and establish principled technical pathways for architecture optimization.
π Abstract
Neural speech codecs have revolutionized speech coding, achieving higher compression while preserving audio fidelity. Beyond compression, they have emerged as tokenization strategies, enabling language modeling on speech and driving paradigm shifts across various speech processing tasks. Despite these advancements, their robustness in noisy environments remains underexplored, raising concerns about their generalization to real-world scenarios. In this work, we systematically evaluate neural speech codecs under various noise conditions, revealing non-trivial differences in their robustness. We further examine their linearity properties, uncovering non-linear distortions which partly explain observed variations in robustness. Lastly, we analyze their frequency response to identify factors affecting audio fidelity. Our findings provide critical insights into codec behavior and future codec design, as well as emphasizing the importance of noise robustness for their real-world integration.