Zero-Shot Mono-to-Binaural Speech Synthesis

πŸ“… 2024-12-11
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work introduces the first zero-shot monaural-to-binaural speech synthesis method, requiring no binaural audio for trainingβ€”only monaural input and source azimuth information. Methodologically, it employs a parameter-free geometric time-domain warping and a source-driven amplitude scaling as initialization, followed by iterative refinement using a pre-trained diffusion vocoder to achieve cross-room generalization. The core contribution lies in eliminating reliance on binaural supervision by leveraging geometric priors and zero-shot transfer to model spatial auditory cues. Subjective listening evaluations on standard benchmarks match those of supervised methods, while objective metrics and human assessments on the newly constructed out-of-distribution dataset TUT Mono-to-Binaural significantly surpass existing supervised models.

Technology Category

Application Category

πŸ“ Abstract
We present ZeroBAS, a neural method to synthesize binaural audio from monaural audio recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural audio synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on the standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset. Our results highlight the potential of pretrained generative audio models and zero-shot learning to unlock robust binaural audio synthesis.
Problem

Research questions and friction points this paper is trying to address.

Synthesizing binaural audio from monaural recordings without binaural training data
Using geometric time warping and amplitude scaling for initial binaural synthesis
Generalizing across room conditions with a zero-shot neural approach
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot neural binaural synthesis without training data
Geometric time warping and amplitude scaling
Pretrained denoising vocoder refines initial synthesis
πŸ”Ž Similar Papers
No similar papers found.
A
Alon Levkovitch
Google Research
Julian Salazar
Julian Salazar
Google DeepMind
Natural Language ProcessingSpeech RecognitionMachine TranslationSpeech SynthesisMathematics
S
Soroosh Mariooryad
Google DeepMind
R
R. Skerry-Ryan
Google DeepMind
N
Nadav Bar
Google Research
B
Bastiaan Kleijn
Google Research
Eliya Nachmani
Eliya Nachmani
Ben-Gurion University; Google Research
Deep LearningSpeechAudioSignal ProcessingInformation Theory