MOSPA: Human Motion Generation Driven by Spatial Audio

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical limitation in spatial-audio-driven human motion generation—namely, the neglect of spatial source characteristics. We propose the first dedicated framework for spatial-audio-driven motion generation, accompanied by a high-quality, purpose-built dataset. Methodologically, we design a diffusion-based multimodal generative architecture that introduces novel spatial audio feature extraction and spatiotemporal-aware cross-modal fusion, explicitly modeling the mapping from sound source azimuth, distance, and dynamics to full-body motion. Our dataset is the first benchmark to jointly capture multi-source spatial configurations, trajectories, and high-fidelity motion capture. Experiments on our dataset demonstrate state-of-the-art performance: generated motions are natural and diverse, directional responses achieve high accuracy, and the model exhibits strong generalization to unseen source configurations and environments—advancing the deep integration of auditory perception and embodied motion synthesis.

Technology Category

Application Category

📝 Abstract
Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA could generate diverse realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our model and dataset will be open-sourced upon acceptance. Please refer to our supplementary video for more details.
Problem

Research questions and friction points this paper is trying to address.

Enabling virtual humans to respond realistically to spatial audio
Lack of models utilizing spatial audio features for motion generation
Creating a dataset and model for spatial audio-driven human motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops first spatial audio-motion dataset
Uses diffusion-based generative framework
Effective fusion of audio and motion
🔎 Similar Papers
No similar papers found.