🤖 AI Summary
Existing speech tokenizers struggle to simultaneously ensure learnability for language models and high-fidelity waveform reconstruction, hindering the development of unified architectures for speech generation and understanding. This work proposes HoliTok, a continuous holistic speech tokenizer that encodes 48 kHz audio into a compact 25 Hz, 128-dimensional latent sequence. Through progressive multi-objective training, HoliTok jointly optimizes signal fidelity, semantic preservation, and latent space learnability. It is the first continuous representation capable of efficiently supporting both high-quality speech synthesis and robust automatic speech recognition within a single framework. Notably, HoliTok operates stably in a unified autoregressive and diffusion transformer (AR+DiT) architecture without requiring auxiliary techniques, significantly outperforming existing methods in reconstruction quality, generation controllability, and cross-task robustness.
📝 Abstract
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.