Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

To address the substantial storage overhead, low transmission efficiency, and heightened privacy risks inherent in continuous speech representations, this paper proposes Codec2Vec—the first self-supervised speech representation learning framework entirely based on discrete acoustic units from neural audio codecs. Codec2Vec employs a masked discrete unit prediction objective, jointly optimizing multiple targets to learn robust and compact speech representations. On the SUPERB benchmark, it achieves performance comparable to continuous-input models while reducing storage requirements by up to 16.5× and accelerating training by 2.3×. Crucially, its discrete tokenization enables native on-device data anonymization, significantly enhancing privacy preservation and system scalability. The core contribution lies in pioneering the direct modeling of discrete speech codes as the fundamental units for self-supervised learning—establishing a new paradigm for efficient, secure, and lightweight speech representation learning.

Technology Category

Application Category

📝 Abstract

Recent advancements in neural audio codecs have not only enabled superior audio compression but also enhanced speech synthesis techniques. Researchers are now exploring their potential as universal acoustic feature extractors for a broader range of speech processing tasks. Building on this trend, we introduce Codec2Vec, the first speech representation learning framework that relies exclusively on discrete audio codec units. This approach offers several advantages, including improved data storage and transmission efficiency, faster training, and enhanced data privacy. We explore masked prediction with various training target derivation strategies to thoroughly understand the effectiveness of this framework. Evaluated on the SUPERB benchmark, Codec2Vec achieves competitive performance compared to continuous-input models while reducing storage requirements by up to 16.5x and training time by 2.3x, showcasing its scalability and efficiency.

Problem

Research questions and friction points this paper is trying to address.

Developing self-supervised speech representation learning using discrete codec units

Exploring masked prediction strategies for efficient speech processing

Reducing storage and training requirements while maintaining competitive performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses discrete audio codec units for representation learning

Employs masked prediction with multiple training strategies

Achieves high compression and faster training than continuous models

🔎 Similar Papers

No similar papers found.