🤖 AI Summary
This work addresses high-fidelity, streaming-capable binaural speech synthesis. We propose the first generative—rather than regression-based—approach to binaural rendering. Our method introduces: (1) a causal U-Net architecture enforcing strict temporal causality; and (2) a continuous streaming STFT/ISTFT inference pipeline incorporating a buffer bank, midpoint solver, and early-skip scheduling strategy to achieve low-latency, acoustically coherent real-time synthesis. Built upon a conditional flow matching framework, the model takes monaural speech and speaker/listener spatial coordinates as input and synthesizes high-fidelity binaural audio end-to-end. Quantitative evaluations and subjective listening tests demonstrate state-of-the-art performance: objective metrics surpass prior methods, and perceptual experiments yield a 42% confusion rate with authentic recordings—approaching the fidelity of real binaural recordings.
📝 Abstract
Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a $42%$ confusion rate.