StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Discrete-token speech generation faces a fundamental trade-off between streaming capability and audio quality, primarily due to the high latency induced by global self-attention in autoregressive models. Method: This paper proposes an efficient streaming decoding framework based on flow matching. Its core innovation is a block-wise guided attention masking mechanism that partitions sequences and enforces hierarchical local attention constraints to explicitly model historical dependencies—thereby avoiding the latency overhead of full-sequence attention. The method integrates diffusion Transformers with flow matching to jointly optimize low-latency inference and high-fidelity waveform reconstruction. Contribution/Results: Experiments demonstrate a first-token latency of only 180 ms, significantly accelerated inference speed for long sequences, and speech quality (MOS) on par with non-streaming baselines. The framework enables real-time, interactive speech synthesis without compromising perceptual fidelity.

Technology Category

Application Category

📝 Abstract

Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms.footnote{Speech samples: https://dukguo.github.io/StreamFlow/}

Problem

Research questions and friction points this paper is trying to address.

Enables real-time token-to-waveform streaming speech generation

Solves audio degradation in token-by-token streaming synthesis

Reduces long-sequence extrapolation issues with block-wise attention

Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming flow matching with diffusion transformers

Block-wise guided attention mask strategy

Hierarchical receptive field regulation

🔎 Similar Papers

Mamba for Streaming ASR Combined with Unimodal Aggregation