TTMBA: Towards Text To Multiple Sources Binaural Audio Generation

πŸ“… 2025-07-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing text-to-audio (TTA) methods predominantly generate monaural audio, lacking spatial cues and thus failing to support immersive auditory experiences. This work proposes the first end-to-end cascaded framework for binaural audio generation, enabling text-driven synthesis of multi-source audio with spatiotemporal controllability. The method comprises three stages: (1) structured text parsing to extract event-level spatiotemporal semantics; (2) a pre-trained monaural audio generator producing temporally aligned source-specific waveforms; and (3) joint modeling via large language model–guided prompting and a neural binaural rendering module for precise sound-source localization and binaural signal synthesis. Experiments demonstrate significant improvements over state-of-the-art TTA methods in both audio fidelity (STOI, PESQ) and spatial perception accuracy (azimuth estimation error). To our knowledge, this is the first approach to unify fine-grained temporal evolution and 3D spatial control for text-to-multi-source binaural audio synthesis.

Technology Category

Application Category

πŸ“ Abstract
Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.
Problem

Research questions and friction points this paper is trying to address.

Generates binaural audio from text with spatial control
Converts mono audio to binaural using spatial data
Enhances immersive auditory experiences with temporal accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM segments text with spatial details
Mono audio network generates event audios
Binaural rendering creates spatial audio
πŸ”Ž Similar Papers
No similar papers found.