BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses high-fidelity, streaming-capable binaural speech synthesis. We propose the first generative—rather than regression-based—approach to binaural rendering. Our method introduces: (1) a causal U-Net architecture enforcing strict temporal causality; and (2) a continuous streaming STFT/ISTFT inference pipeline incorporating a buffer bank, midpoint solver, and early-skip scheduling strategy to achieve low-latency, acoustically coherent real-time synthesis. Built upon a conditional flow matching framework, the model takes monaural speech and speaker/listener spatial coordinates as input and synthesizes high-fidelity binaural audio end-to-end. Quantitative evaluations and subjective listening tests demonstrate state-of-the-art performance: objective metrics surpass prior methods, and perceptual experiments yield a 42% confusion rate with authentic recordings—approaching the fidelity of real binaural recordings.

Technology Category

Application Category

📝 Abstract

Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a $42%$ confusion rate.

Problem

Research questions and friction points this paper is trying to address.

Improves binaural speech synthesis quality using flow matching models

Enables real-time streaming inference for binaural audio rendering

Models binaural cues, room reverb, and ambient sounds accurately

Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow matching model for binaural synthesis

Causal U-Net for streaming inference

Continuous pipeline with STFT and buffer

🔎 Similar Papers

No similar papers found.