🤖 AI Summary
This paper addresses the challenge of simultaneously achieving low latency and high quality in streaming multimodal sequence-to-sequence modeling. We propose Delayed Streaming Modeling (DSM), which shifts input–output temporal alignment to the preprocessing stage, employs a decoder-only architecture for unified multimodal stream modeling, and introduces a learnable inter-stream dynamic latency mechanism enabling end-to-end streaming inference over arbitrarily long sequences. DSM achieves near-offline model performance while maintaining strict online latency constraints. Evaluated on automatic speech recognition (ASR) and text-to-speech (TTS), DSM establishes new state-of-the-art results, significantly outperforming existing streaming approaches. To our knowledge, it is the first framework to provide a flexible, efficient, and unified streaming solution across diverse multimodal sequence tasks, bridging the gap between streaming efficiency and offline-quality modeling.
📝 Abstract
We introduce Delayed Streams Modeling (DSM), a flexible formulation for streaming, multimodal sequence-to-sequence learning. Sequence-to-sequence generation is often cast in an offline manner, where the model consumes the complete input sequence before generating the first output timestep. Alternatively, streaming sequence-to-sequence rely on learning a policy for choosing when to advance on the input stream, or write to the output stream. DSM instead models already time-aligned streams with a decoder-only language model. By moving the alignment to a pre-processing step,and introducing appropriate delays between streams, DSM provides streaming inference of arbitrary output sequences, from any input combination, making it applicable to many sequence-to-sequence problems. In particular, given text and audio streams, automatic speech recognition (ASR) corresponds to the text stream being delayed, while the opposite gives a text-to-speech (TTS) model. We perform extensive experiments for these two major sequence-to-sequence tasks, showing that DSM provides state-of-the-art performance and latency while supporting arbitrary long sequences, being even competitive with offline baselines. Code, samples and demos are available at https://github.com/kyutai-labs/delayed-streams-modeling