Factorized RVQ-GAN For Disentangled Speech Tokenization

📅 2025-06-18

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenge of disentangling coupled multi-level linguistic structures—acoustic, phonemic, and lexical—in speech representations. We propose the Hierarchical Audio Codec (HAC), a novel RVQ-GAN-based architecture that achieves, for the first time in a single model, explicit three-level factorization: hierarchical vector quantization separately models acoustic, phonemic, and lexical bottleneck representations. To ensure phoneme alignment, we incorporate HuBERT-based speech distillation; to enhance lexical semantic interpretability, we integrate LaBSE-based cross-modal semantic distillation. A multi-objective joint optimization framework coordinates these components. Experiments on English and multilingual datasets demonstrate that HAC significantly improves hierarchical disentanglement, speech reconstruction fidelity, and token naturalness. The resulting tokens exhibit both high phoneme alignment accuracy and strong lexical semantic consistency, outperforming single-layer baselines across all metrics.

Technology Category

Application Category

📝 Abstract

We propose Hierarchical Audio Codec (HAC), a unified neural speech codec that factorizes its bottleneck into three linguistic levels-acoustic, phonetic, and lexical-within a single model. HAC leverages two knowledge distillation objectives: one from a pre-trained speech encoder (HuBERT) for phoneme-level structure, and another from a text-based encoder (LaBSE) for lexical cues. Experiments on English and multilingual data show that HAC's factorized bottleneck yields disentangled token sets: one aligns with phonemes, while another captures word-level semantics. Quantitative evaluations confirm that HAC tokens preserve naturalness and provide interpretable linguistic information, outperforming single-level baselines in both disentanglement and reconstruction quality. These findings underscore HAC's potential as a unified discrete speech representation, bridging acoustic detail and lexical meaning for downstream speech generation and understanding tasks.

Problem

Research questions and friction points this paper is trying to address.

Disentangle speech into acoustic, phonetic, lexical levels

Improve tokenization with phoneme and lexical knowledge distillation

Unify discrete speech representation for generation and understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Factorized bottleneck into three linguistic levels

Leverages HuBERT and LaBSE for distillation

Disentangled tokens for phonemes and semantics

🔎 Similar Papers

dMel: Speech Tokenization made Simple