Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation

📅 2026-05-09
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the multimodal inconsistency arising from the temporal heterogeneity among human-centric actions, speech, and environmental sounds in video. To tackle this challenge, the authors propose Unison, a unified generative framework that enables coherent synthesis of these modalities. Unison leverages semantic-guided audio disentanglement and semantic-conditioned gating to recombine audio components, while integrating bidirectional cross-modal alignment constraints with a progressive stabilization mechanism. The framework innovatively incorporates bidirectional denoising scheduling and audio cross-attention modules, which substantially enhance cross-modal synchronization and perceptual audio quality. Experimental results demonstrate that Unison achieves state-of-the-art performance in mitigating inter-modal misalignment, setting a new benchmark for temporally coherent multimodal generation in human-centered video synthesis.
📝 Abstract
Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.
Problem

Research questions and friction points this paper is trying to address.

audio-video generation
cross-modal synchronization
motion-speech-sound alignment
multimodal harmonization
human-centric video
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal harmonization
semantic-guided audio generation
cross-modal synchronization
bidirectional cross-attention
decoupled denoising
🔎 Similar Papers
No similar papers found.