BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses high-fidelity, streaming-capable binaural speech synthesis. We propose the first generative—rather than regression-based—approach to binaural rendering. Our method introduces: (1) a causal U-Net architecture enforcing strict temporal causality; and (2) a continuous streaming STFT/ISTFT inference pipeline incorporating a buffer bank, midpoint solver, and early-skip scheduling strategy to achieve low-latency, acoustically coherent real-time synthesis. Built upon a conditional flow matching framework, the model takes monaural speech and speaker/listener spatial coordinates as input and synthesizes high-fidelity binaural audio end-to-end. Quantitative evaluations and subjective listening tests demonstrate state-of-the-art performance: objective metrics surpass prior methods, and perceptual experiments yield a 42% confusion rate with authentic recordings—approaching the fidelity of real binaural recordings.

Technology Category

Application Category

📝 Abstract
Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a $42%$ confusion rate.
Problem

Research questions and friction points this paper is trying to address.

Improves binaural speech synthesis quality using flow matching models
Enables real-time streaming inference for binaural audio rendering
Models binaural cues, room reverb, and ambient sounds accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow matching model for binaural synthesis
Causal U-Net for streaming inference
Continuous pipeline with STFT and buffer
🔎 Similar Papers
No similar papers found.
Susan Liang
Susan Liang
University of Rochester
Computer Vision
D
Dejan Markovic
Codec Avatars Lab, Meta, PA, USA
I
I. D. Gebru
Codec Avatars Lab, Meta, PA, USA
S
Steven Krenn
Codec Avatars Lab, Meta, PA, USA
T
Todd Keebler
Codec Avatars Lab, Meta, PA, USA
J
Jacob Sandakly
Codec Avatars Lab, Meta, PA, USA
F
Frank Yu
Codec Avatars Lab, Meta, PA, USA
S
Samuel Hassel
Codec Avatars Lab, Meta, PA, USA
Chenliang Xu
Chenliang Xu
Associate Professor of Computer Science, University of Rochester
Computer VisionMultimodal LearningVideo UnderstandingVision and Language
Alexander Richard
Alexander Richard
Research Scientist, Facebook Reality Labs
Audio processingNeural NetworksAction RecognitionDeep LearningOptimization