🤖 AI Summary
To address acoustic detail loss and prosodic degradation in low-frame-rate (~2.62 Hz) text-aligned speech tokenization, this paper proposes a speech token reconstruction framework integrating Multi-Layer Dynamic Attention (MLDA) with Finite Scalar Quantization (FSQ). Methodologically, we freeze a pre-trained speech encoder to preserve low-level acoustic representations; design MLDA to enable text-position-adaptive aggregation of shallow and deep features; and employ FSQ to efficiently quantize salient acoustic information under extreme compression. Experiments demonstrate that our approach significantly outperforms TASTE on both in-domain and cross-domain datasets. While maintaining the ultra-low frame rate, it substantially improves prosodic naturalness and audio fidelity of reconstructed speech. This work establishes a novel paradigm for efficient, high-fidelity speech tokenization—balancing computational efficiency with perceptual quality—thereby advancing practical applications in text-to-speech, voice conversion, and spoken language modeling.
📝 Abstract
We propose Text-Aligned Speech Tokens with Multiple Layer-Aggregation (TASLA), which is a text-aligned speech tokenization framework that aims to address the problem that under a low-frame-rate and text-aligned regime, single-source speech tokens may lose acoustic details during reconstruction. On the other hand, this paper further explains how different encoder layers collaborate to capture comprehensive acoustic features for tokenization. Previous work, TASTE, proposed the text-aligned speech tokenization framework, which is a LM-friendly architecture, but struggles to capture acoustic details. We address this trade-off with two components: Multi-Layer Dynamic Attention (MLDA), which lets each text position adaptively mix shallow/deep features from a frozen speech encoder, and Finite Scalar Quantization (FSQ), a simple per-dimension discretization with smooth optimization. At about 2.62 Hz (tokens/s), TASLA consistently improves prosody and achieves competitive quality over TASTE on in-domain (LibriSpeech) and OOD (EXPRESSO, Voxceleb) sets. We further demonstrate that dynamic layer mixing is correlated with spectral flux and explains why MLDA preserves prosody under a low frame rate with extreme feature compression.