Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

📅 2025-09-10

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This paper addresses the challenge of simultaneously achieving low latency and high quality in streaming multimodal sequence-to-sequence modeling. We propose Delayed Streaming Modeling (DSM), which shifts input–output temporal alignment to the preprocessing stage, employs a decoder-only architecture for unified multimodal stream modeling, and introduces a learnable inter-stream dynamic latency mechanism enabling end-to-end streaming inference over arbitrarily long sequences. DSM achieves near-offline model performance while maintaining strict online latency constraints. Evaluated on automatic speech recognition (ASR) and text-to-speech (TTS), DSM establishes new state-of-the-art results, significantly outperforming existing streaming approaches. To our knowledge, it is the first framework to provide a flexible, efficient, and unified streaming solution across diverse multimodal sequence tasks, bridging the gap between streaming efficiency and offline-quality modeling.

Technology Category

Application Category

📝 Abstract

We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling

Problem

Research questions and friction points this paper is trying to address.

Streaming sequence-to-sequence learning with delayed alignment

Modeling multimodal inputs for real-time generation tasks

Handling arbitrary long sequences with low latency inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Delayed Streams Modeling for streaming multimodal learning

Decoder-only language model with pre-aligned streams

Introduces delays between streams for flexible inference

🔎 Similar Papers

Streaming Sequence Transduction through Dynamic Compression