Real-Time Streaming Mel Vocoding with Generative Flow Matching

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses real-time streaming Mel vocoding at 16 kHz sampling, proposing MelFlow—a low-latency, end-to-end vocoder for streaming inverse transformation from Mel spectrograms to waveforms. Methodologically, it introduces the first joint modeling of flow matching—a generative technique enabling efficient sample generation—with the pseudo-inverse of the Mel filterbank operator, integrated with DiffPhase for phase reconstruction. This design achieves an algorithmic latency of only 32 ms (total system latency: 48 ms). Evaluated on consumer-grade laptop GPUs, MelFlow operates at >16× real-time speed while significantly outperforming non-streaming baselines such as HiFi-GAN in PESQ and SI-SDR. To our knowledge, this is the first efficient, deployable streaming Mel vocoder tailored for ultra-low-latency TTS systems.

Technology Category

Application Category

📝 Abstract

The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudoinverse operator of the Mel filterbank, we develop MelFlow, a streaming-capable generative Mel vocoder for speech sampled at 16 kHz with an algorithmic latency of only 32 ms and a total latency of 48 ms. We show real-time streaming capability at this latency not only in theory, but in practice on a consumer laptop GPU. Furthermore, we show that our model achieves substantially better PESQ and SI-SDR values compared to well-established not streaming-capable baselines for Mel vocoding including HiFi-GAN.

Problem

Research questions and friction points this paper is trying to address.

Real-time streaming Mel vocoding for low-latency TTS

Inverting Mel spectrograms to high-quality audio waveforms

Achieving superior audio quality over non-streaming baselines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative flow matching for Mel vocoding

Streaming-capable with 48ms total latency

Real-time performance on consumer GPU

🔎 Similar Papers

FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates

2024-09-26IEEE International Conference on Acoustics, Speech, and Signal ProcessingCitations: 0