StreamFlow: Streaming Flow Matching with Block-wise Guided Attention Mask for Speech Token Decoding

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Discrete-token speech generation faces a fundamental trade-off between streaming capability and audio quality, primarily due to the high latency induced by global self-attention in autoregressive models. Method: This paper proposes an efficient streaming decoding framework based on flow matching. Its core innovation is a block-wise guided attention masking mechanism that partitions sequences and enforces hierarchical local attention constraints to explicitly model historical dependencies—thereby avoiding the latency overhead of full-sequence attention. The method integrates diffusion Transformers with flow matching to jointly optimize low-latency inference and high-fidelity waveform reconstruction. Contribution/Results: Experiments demonstrate a first-token latency of only 180 ms, significantly accelerated inference speed for long sequences, and speech quality (MOS) on par with non-streaming baselines. The framework enables real-time, interactive speech synthesis without compromising perceptual fidelity.

Technology Category

Application Category

📝 Abstract
Recent advancements in discrete token-based speech generation have highlighted the importance of token-to-waveform generation for audio quality, particularly in real-time interactions. Traditional frameworks integrating semantic tokens with flow matching (FM) struggle with streaming capabilities due to their reliance on a global receptive field. Additionally, directly implementing token-by-token streaming speech generation often results in degraded audio quality. To address these challenges, we propose StreamFlow, a novel neural architecture that facilitates streaming flow matching with diffusion transformers (DiT). To mitigate the long-sequence extrapolation issues arising from lengthy historical dependencies, we design a local block-wise receptive field strategy. Specifically, the sequence is first segmented into blocks, and we introduce block-wise attention masks that enable the current block to receive information from the previous or subsequent block. These attention masks are combined hierarchically across different DiT-blocks to regulate the receptive field of DiTs. Both subjective and objective experimental results demonstrate that our approach achieves performance comparable to non-streaming methods while surpassing other streaming methods in terms of speech quality, all the while effectively managing inference time during long-sequence generation. Furthermore, our method achieves a notable first-packet latency of only 180 ms.footnote{Speech samples: https://dukguo.github.io/StreamFlow/}
Problem

Research questions and friction points this paper is trying to address.

Enables real-time token-to-waveform streaming speech generation
Solves audio degradation in token-by-token streaming synthesis
Reduces long-sequence extrapolation issues with block-wise attention
Innovation

Methods, ideas, or system contributions that make the work stand out.

Streaming flow matching with diffusion transformers
Block-wise guided attention mask strategy
Hierarchical receptive field regulation
🔎 Similar Papers
No similar papers found.
Dake Guo
Dake Guo
Northwestern Polytechnical University
Speech ProcessingSpeech Synthesis
J
Jixun Yao
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
L
Linhan Ma
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
W
Wang He
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China
L
Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China