TASTE-Streaming: Towards Streamable Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

📅 2026-03-12

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing speech language models are hindered by the mismatch in sequence lengths between text and speech modalities and lack support for streaming processing, making low-latency real-time applications challenging. This work proposes TASTE-S, the first end-to-end streaming model that achieves text-aligned speech tokenization. By integrating a CTC-based ASR module with a causal encoder for dual-modal online encoding and designing a unit decoder amenable to streaming decoding, TASTE-S enables real-time speech embedding without relying on external ASR systems. A joint training strategy allows the model to maintain performance comparable to TASTE while significantly reducing latency, demonstrating enhanced transcription robustness and improved capability in handling long-form speech.

Technology Category

Application Category

📝 Abstract

Text-speech joint spoken language modeling (SLM) aims at natural and intelligent speech-based interactions, but developing such a system may suffer from modality mismatch: speech unit sequences are much longer than text tokens. Prior work reduces this gap with text-aligned tokenization and embedding (TASTE), producing speech tokens that align in lengths with their textual counterparts. However, the dependence on an external ASR system and the use of a non-causal decoder limits streaming use. To address this limitation, we propose TASTE-S, a streamable extension of TASTE suitable for real-time usage. TASTE-S integrates a CTC-based ASR module into the encoder for instant dual-modality encoding. We also redesign the unit decoder to enable on-the-fly decoding. With joint training, we show that TASTE-S matches TASTE's performance while significantly reducing latency. Further investigations reveal that TASTE-S remains robust to transcriptions and enables long-form encoding and decoding.

Problem

Research questions and friction points this paper is trying to address.

streaming

spoken language modeling

modality mismatch

real-time processing

text-speech alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

streamable

text-aligned speech tokenization

CTC-based ASR