🤖 AI Summary
This work addresses the challenge of learning robust, invertible, highly compressive, and language-model-friendly speech representations. To this end, we propose a two-stage self-supervised framework. In the first stage, semantic audio features are learned in latent space via masked prediction using a Joint Embedding Predictive Architecture (JEPA) augmented with Density-Adaptive Attention Mechanism (DAAM). In the second stage, hierarchical speech structure modeling and invertible token generation are achieved at an ultra-low frame rate of 2.5 Hz, leveraging Gaussian Mixture Density-Adaptive Gating, Finite Scalar Quantization (FSQ), and mixed-radix packing. The method produces compact sequences at 47.5 tokens/second. Reconstructed speech quality matches that of state-of-the-art neural audio codecs, while achieving significantly improved compression efficiency (lower bitrates) and enhanced compatibility with large language models—establishing an efficient foundational representation for speech large models.
📝 Abstract
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.