Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-to-video (A2V) generation methods struggle to achieve fine-grained temporal alignment between audio and visual sequences. To address this, we propose a high-precision synchronization framework comprising three core components: (1) a Motion-aware Loss to enhance the model’s capacity for modeling motion in dynamic regions; (2) an Audio Sync Guidance mechanism that directly injects audio features into the diffusion process to enforce frame-level motion alignment; and (3) synergistic integration of a pre-trained video diffusion Transformer, audio-conditional guidance, asynchronous visual alignment distillation, and CycleSync—a novel reconstruction-based evaluation strategy for temporal consistency. Evaluated on AVSync15 and The Greatest Hits datasets, our method generates videos at 24 fps and 380×640 resolution, achieving state-of-the-art performance: a 24.7% relative improvement in synchronization accuracy (Δt < 0.04 s) and an 18.3% reduction in FID score, demonstrating superior audiovisual coherence and visual fidelity.

Technology Category

Application Category

📝 Abstract
Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality. Project page is available at: https://jibin86.github.io/syncphony_project_page
Problem

Research questions and friction points this paper is trying to address.

Generating videos synchronized with audio timing cues
Improving fine-grained audio-video synchronization in generation
Enhancing temporal control in audio-to-video generation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained video backbone for generation
Employs Motion-aware Loss for high-motion regions
Implements Audio Sync Guidance with off-sync model
🔎 Similar Papers
No similar papers found.