π€ AI Summary
Existing text-to-audio (TTA) methods predominantly generate monaural audio, lacking spatial cues and thus failing to support immersive auditory experiences. This work proposes the first end-to-end cascaded framework for binaural audio generation, enabling text-driven synthesis of multi-source audio with spatiotemporal controllability. The method comprises three stages: (1) structured text parsing to extract event-level spatiotemporal semantics; (2) a pre-trained monaural audio generator producing temporally aligned source-specific waveforms; and (3) joint modeling via large language modelβguided prompting and a neural binaural rendering module for precise sound-source localization and binaural signal synthesis. Experiments demonstrate significant improvements over state-of-the-art TTA methods in both audio fidelity (STOI, PESQ) and spatial perception accuracy (azimuth estimation error). To our knowledge, this is the first approach to unify fine-grained temporal evolution and 3D spatial control for text-to-multi-source binaural audio synthesis.
π Abstract
Most existing text-to-audio (TTA) generation methods produce mono outputs, neglecting essential spatial information for immersive auditory experiences. To address this issue, we propose a cascaded method for text-to-multisource binaural audio generation (TTMBA) with both temporal and spatial control. First, a pretrained large language model (LLM) segments the text into a structured format with time and spatial details for each sound event. Next, a pretrained mono audio generation network creates multiple mono audios with varying durations for each event. These mono audios are transformed into binaural audios using a binaural rendering neural network based on spatial data from the LLM. Finally, the binaural audios are arranged by their start times, resulting in multisource binaural audio. Experimental results demonstrate the superiority of the proposed method in terms of both audio generation quality and spatial perceptual accuracy.