Zero-Shot Mono-to-Binaural Speech Synthesis

📅 2024-12-11

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

This work introduces the first zero-shot monaural-to-binaural speech synthesis method, requiring no binaural audio for training—only monaural input and source azimuth information. Methodologically, it employs a parameter-free geometric time-domain warping and a source-driven amplitude scaling as initialization, followed by iterative refinement using a pre-trained diffusion vocoder to achieve cross-room generalization. The core contribution lies in eliminating reliance on binaural supervision by leveraging geometric priors and zero-shot transfer to model spatial auditory cues. Subjective listening evaluations on standard benchmarks match those of supervised methods, while objective metrics and human assessments on the newly constructed out-of-distribution dataset TUT Mono-to-Binaural significantly surpass existing supervised models.

Technology Category

Application Category

📝 Abstract

We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on the standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the potential of pretrained generative audio models and zero-shot learning to unlock robust binaural audio synthesis.

Problem

Research questions and friction points this paper is trying to address.

Synthesizing binaural audio from monaural recordings without binaural training data

Using geometric time warping and amplitude scaling for initial binaural synthesis

Generalizing across room conditions with a zero-shot neural approach

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot neural binaural synthesis without training data

Geometric time warping and amplitude scaling

Pretrained denoising vocoder refines initial synthesis

🔎 Similar Papers

No similar papers found.