JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention

📅 2025-12-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning robust, invertible, highly compressive, and language-model-friendly speech representations. To this end, we propose a two-stage self-supervised framework. In the first stage, semantic audio features are learned in latent space via masked prediction using a Joint Embedding Predictive Architecture (JEPA) augmented with Density-Adaptive Attention Mechanism (DAAM). In the second stage, hierarchical speech structure modeling and invertible token generation are achieved at an ultra-low frame rate of 2.5 Hz, leveraging Gaussian Mixture Density-Adaptive Gating, Finite Scalar Quantization (FSQ), and mixed-radix packing. The method produces compact sequences at 47.5 tokens/second. Reconstructed speech quality matches that of state-of-the-art neural audio codecs, while achieving significantly improved compression efficiency (lower bitrates) and enhanced compatibility with large language models—establishing an efficient foundational representation for speech large models.

Technology Category

Application Category

📝 Abstract
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.
Problem

Research questions and friction points this paper is trying to address.

Learn robust speech representations via self-supervised masked prediction
Enable efficient tokenization with compressed, reversible audio representations
Discover hierarchical speech structure at low frame rates for efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

JEPA with DAAM for masked semantic feature learning
FSQ and mixed-radix packing for efficient tokenization
HiFi-GAN decoder for high-fidelity waveform reconstruction
🔎 Similar Papers
2024-07-22arXiv.orgCitations: 4
G
Georgios Ioannides
Carnegie Mellon University, Amazon GenAI, James Silberrad Brown Center for Artificial Intelligence
C
Christos Constantinou
University of Bristol, Amazon GenAI, James Silberrad Brown Center for Artificial Intelligence
Aman Chadha
Aman Chadha
GenAI Leadership @ Apple • Stanford AI • UW-Madison ECE • Ex: Apple, AWS, Alexa, Nvidia
Multimodal AINatural Language ProcessingComputer VisionSpeech ProcessingRecommender Systems
A
Aaron Elkins
James Silberrad Brown Center for Artificial Intelligence
L
Linsey Pang
Northeastern University
Ravid Shwartz-Ziv
Ravid Shwartz-Ziv
New York University
machine learningdeep learningrepresentation learning theoryneuroscience
Yann LeCun
Yann LeCun
Chief AI Scientist at Facebook & JT Schwarz Professor at the Courant Institute, New York University
AImachine learningcomputer visionroboticsimage compression