Text2Move: Text-to-moving sound generation via trajectory prediction and temporal alignment

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limitation of existing text-to-audio models in modeling 3D motion of sound sources. We propose the first end-to-end framework for controllable generation of dynamic spatial audio. Methodologically: (1) we construct a synthetic dataset pairing natural language descriptions with physically plausible 3D sound-source trajectories; (2) we design a text-to-trajectory prediction module that explicitly models spatial motion semantics; and (3) we jointly fine-tune a pre-trained text-to-audio model, integrating binaural rendering and temporal alignment to synthesize dynamic binaural audio. Experiments demonstrate that our model accurately interprets spatial linguistic cues, generates acoustically coherent audio with physically consistent trajectories, seamlessly integrates into existing text-to-audio pipelines, and supports multiple spatial audio formats—including ambisonics and object-based representations. This significantly enhances the spatiotemporal expressiveness of auditory content generation.

Technology Category

Application Category

📝 Abstract

Human auditory perception is shaped by moving sound sources in 3D space, yet prior work in generative sound modelling has largely been restricted to mono signals or static spatial audio. In this work, we introduce a framework for generating moving sounds given text prompts in a controllable fashion. To enable training, we construct a synthetic dataset that records moving sounds in binaural format, their spatial trajectories, and text captions about the sound event and spatial motion. Using this dataset, we train a text-to-trajectory prediction model that outputs the three-dimensional trajectory of a moving sound source given text prompts. To generate spatial audio, we first fine-tune a pre-trained text-to-audio generative model to output temporally aligned mono sound with the trajectory. The spatial audio is then simulated using the predicted temporally-aligned trajectory. Experimental evaluation demonstrates reasonable spatial understanding of the text-to-trajectory model. This approach could be easily integrated into existing text-to-audio generative workflow and extended to moving sound generation in other spatial audio formats.

Problem

Research questions and friction points this paper is trying to address.

Generating moving sounds from text prompts controllably

Predicting 3D sound trajectories using text descriptions

Creating temporally aligned spatial audio from text inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Predicts 3D sound trajectories from text

Fine-tunes model for temporally aligned mono audio

Simulates spatial audio using predicted trajectories

🔎 Similar Papers

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning