🤖 AI Summary
This study systematically investigates the generalization capabilities of neural audio codecs to unseen languages and non-speech content—including environmental sounds, music, and animal vocalizations. By training a unified codec architecture from scratch and evaluating it on a rigorously controlled, multi-category audio dataset, the work provides the first quantitative assessment of cross-lingual and cross-modal generalization within a single framework. Performance is measured across 11 metrics encompassing signal reconstruction fidelity and downstream task effectiveness. Results demonstrate that the model generalizes effectively to unseen languages. However, pretraining exclusively on speech significantly degrades performance on non-speech tasks. In contrast, incorporating diverse non-speech data during pretraining not only enhances non-speech task performance but also preserves high-quality speech reconstruction.
📝 Abstract
This paper investigates three crucial yet underexplored aspects of the generalization capabilities of neural audio codecs (NACs): (i) whether NACs can generalize to unseen languages during pre-training, (ii) whether speech-only pre-trained NACs can effectively generalize to non-speech applications such as environmental sounds, music, and animal vocalizations, and (iii) whether incorporating non-speech data during pre-training can improve performance on both speech and non-speech tasks. Existing studies typically rely on off-the-shelf NACs for comparison, which limits insight due to variations in implementation. In this work, we train NACs from scratch using strictly controlled configurations and carefully curated pre-training data to enable fair comparisons. We conduct a comprehensive evaluation of NAC performance on both signal reconstruction quality and downstream applications using 11 metrics. Our results show that NACs can generalize to unseen languages during pre-training, speech-only pre-trained NACs exhibit degraded performance on non-speech tasks, and incorporating non-speech data during pre-training improves performance on non-speech tasks while maintaining comparable performance on speech tasks.