HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

Existing speech tokenizers struggle to simultaneously ensure learnability for language models and high-fidelity waveform reconstruction, hindering the development of unified architectures for speech generation and understanding. This work proposes HoliTok, a continuous holistic speech tokenizer that encodes 48 kHz audio into a compact 25 Hz, 128-dimensional latent sequence. Through progressive multi-objective training, HoliTok jointly optimizes signal fidelity, semantic preservation, and latent space learnability. It is the first continuous representation capable of efficiently supporting both high-quality speech synthesis and robust automatic speech recognition within a single framework. Notably, HoliTok operates stably in a unified autoregressive and diffusion transformer (AR+DiT) architecture without requiring auxiliary techniques, significantly outperforming existing methods in reconstruction quality, generation controllability, and cross-task robustness.

📝 Abstract

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.

Problem

Research questions and friction points this paper is trying to address.

speech tokenization

unified speech modeling

speech generation

speech understanding

foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

holistic tokenization

unified speech modeling

continuous latent representation