Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing audio-to-video (A2V) generation methods struggle to achieve fine-grained temporal alignment between audio and visual sequences. To address this, we propose a high-precision synchronization framework comprising three core components: (1) a Motion-aware Loss to enhance the model’s capacity for modeling motion in dynamic regions; (2) an Audio Sync Guidance mechanism that directly injects audio features into the diffusion process to enforce frame-level motion alignment; and (3) synergistic integration of a pre-trained video diffusion Transformer, audio-conditional guidance, asynchronous visual alignment distillation, and CycleSync—a novel reconstruction-based evaluation strategy for temporal consistency. Evaluated on AVSync15 and The Greatest Hits datasets, our method generates videos at 24 fps and 380×640 resolution, achieving state-of-the-art performance: a 24.7% relative improvement in synchronization accuracy (Δt < 0.04 s) and an 18.3% reduction in FID score, demonstrating superior audiovisual coherence and visual fidelity.

Technology Category

Application Category

📝 Abstract

Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a promising condition for temporally controlled video generation. However, existing audio-to-video (A2V) models struggle with fine-grained synchronization due to indirect conditioning mechanisms or limited temporal modeling capacity. We present Syncphony, which generates 380x640 resolution, 24fps videos synchronized with diverse audio inputs. Our approach builds upon a pre-trained video backbone and incorporates two key components to improve synchronization: (1) Motion-aware Loss, which emphasizes learning at high-motion regions; (2) Audio Sync Guidance, which guides the full model using a visually aligned off-sync model without audio layers to better exploit audio cues at inference while maintaining visual quality. To evaluate synchronization, we propose CycleSync, a video-to-audio-based metric that measures the amount of motion cues in the generated video to reconstruct the original audio. Experiments on AVSync15 and The Greatest Hits datasets demonstrate that Syncphony outperforms existing methods in both synchronization accuracy and visual quality. Project page is available at: https://jibin86.github.io/syncphony_project_page

Problem

Research questions and friction points this paper is trying to address.

Generating videos synchronized with audio timing cues

Improving fine-grained audio-video synchronization in generation

Enhancing temporal control in audio-to-video generation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pre-trained video backbone for generation

Employs Motion-aware Loss for high-motion regions

Implements Audio Sync Guidance with off-sync model

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

Research Scientist Intern (TikTok-Privacy Innovation Lab-Multimodal Generative Model) - 2026 Start (PhD)

TikTok

San Jose, California

Authors to Follow