🤖 AI Summary
To address the substantial storage overhead, low transmission efficiency, and heightened privacy risks inherent in continuous speech representations, this paper proposes Codec2Vec—the first self-supervised speech representation learning framework entirely based on discrete acoustic units from neural audio codecs. Codec2Vec employs a masked discrete unit prediction objective, jointly optimizing multiple targets to learn robust and compact speech representations. On the SUPERB benchmark, it achieves performance comparable to continuous-input models while reducing storage requirements by up to 16.5× and accelerating training by 2.3×. Crucially, its discrete tokenization enables native on-device data anonymization, significantly enhancing privacy preservation and system scalability. The core contribution lies in pioneering the direct modeling of discrete speech codes as the fundamental units for self-supervised learning—establishing a new paradigm for efficient, secure, and lightweight speech representation learning.
📝 Abstract
Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.