TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address acoustic detail loss and prosodic degradation in low-frame-rate (~2.62 Hz) text-aligned speech tokenization, this paper proposes a speech token reconstruction framework integrating Multi-Layer Dynamic Attention (MLDA) with Finite Scalar Quantization (FSQ). Methodologically, we freeze a pre-trained speech encoder to preserve low-level acoustic representations; design MLDA to enable text-position-adaptive aggregation of shallow and deep features; and employ FSQ to efficiently quantize salient acoustic information under extreme compression. Experiments demonstrate that our approach significantly outperforms TASTE on both in-domain and cross-domain datasets. While maintaining the ultra-low frame rate, it substantially improves prosodic naturalness and audio fidelity of reconstructed speech. This work establishes a novel paradigm for efficient, high-fidelity speech tokenization—balancing computational efficiency with perceptual quality—thereby advancing practical applications in text-to-speech, voice conversion, and spoken language modeling.

Technology Category

Application Category

📝 Abstract

We propose Text-Aligned Speech Tokens with Multiple Layer-Aggregation (TASLA), which is a text-aligned speech tokenization framework that aims to address the problem that under a low-frame-rate and text-aligned regime, single-source speech tokens may lose acoustic details during reconstruction. On the other hand, this paper further explains how different encoder layers collaborate to capture comprehensive acoustic features for tokenization. Previous work, TASTE, proposed the text-aligned speech tokenization framework, which is a LM-friendly architecture, but struggles to capture acoustic details. We address this trade-off with two components: Multi-Layer Dynamic Attention (MLDA), which lets each text position adaptively mix shallow/deep features from a frozen speech encoder, and Finite Scalar Quantization (FSQ), a simple per-dimension discretization with smooth optimization. At about 2.62 Hz (tokens/s), TASLA consistently improves prosody and achieves competitive quality over TASTE on in-domain (LibriSpeech) and OOD (EXPRESSO, Voxceleb) sets. We further demonstrate that dynamic layer mixing is correlated with spectral flux and explains why MLDA preserves prosody under a low frame rate with extreme feature compression.

Problem

Research questions and friction points this paper is trying to address.

Addressing acoustic detail loss in low-frame-rate speech tokenization

Improving text-aligned speech reconstruction quality and prosody

Enhancing multi-layer feature aggregation for comprehensive acoustic modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Layer Dynamic Attention adaptively mixes encoder features

Finite Scalar Quantization enables smooth per-dimension discretization

Layer aggregation captures comprehensive acoustic features at 2.62Hz

🔎 Similar Papers

dMel: Speech Tokenization made Simple

2024-07-22arXiv.orgCitations: 4

Together AI

$200,000 - $260,000 + equity + benefits

San Francisco / San Francisco, San Francisco, California, United States

AI Research Scientist - Meta Superintelligence Labs (PhD)