Factorized RVQ-GAN For Disentangled Speech Tokenization

📅 2025-06-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of disentangling coupled multi-level linguistic structures—acoustic, phonemic, and lexical—in speech representations. We propose the Hierarchical Audio Codec (HAC), a novel RVQ-GAN-based architecture that achieves, for the first time in a single model, explicit three-level factorization: hierarchical vector quantization separately models acoustic, phonemic, and lexical bottleneck representations. To ensure phoneme alignment, we incorporate HuBERT-based speech distillation; to enhance lexical semantic interpretability, we integrate LaBSE-based cross-modal semantic distillation. A multi-objective joint optimization framework coordinates these components. Experiments on English and multilingual datasets demonstrate that HAC significantly improves hierarchical disentanglement, speech reconstruction fidelity, and token naturalness. The resulting tokens exhibit both high phoneme alignment accuracy and strong lexical semantic consistency, outperforming single-layer baselines across all metrics.

Technology Category

Application Category

📝 Abstract
We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.
Problem

Research questions and friction points this paper is trying to address.

Disentangle speech into acoustic, phonetic, lexical levels
Improve tokenization with phoneme and lexical knowledge distillation
Unify discrete speech representation for generation and understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Factorized bottleneck into three linguistic levels
Leverages HuBERT and LaBSE for distillation
Disentangled tokens for phonemes and semantics
🔎 Similar Papers
No similar papers found.