AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the dual challenges of cross-modal temporal alignment and high-fidelity reconstruction in bidirectional audio-video generation. We propose a unified framework built upon frozen audio-video diffusion models, eliminating the need for external feature extractors. Our approach introduces a novel temporally aligned bidirectional self-attention fusion module that directly leverages complementary internal features from the pre-trained model as cross-modal conditioning signals. Additionally, we incorporate a cross-modal diffusion distillation strategy to enhance generation consistency across modalities. The method supports both video-to-audio and audio-to-video synthesis while preserving modality-specific fidelity and significantly improving temporal synchronization and semantic consistency. Experiments demonstrate substantial improvements over state-of-the-art methods across multiple benchmarks, achieving new SOTA performance in synchronization accuracy, perceptual quality, and reconstruction fidelity. This work establishes a new paradigm for immersive multimodal content generation.

Technology Category

Application Category

📝 Abstract

We propose AV-Link, a unified framework for Video-to-Audio and Audio-to-Video generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that enables bidirectional information exchange between our backbone video and audio diffusion models through a temporally-aligned self attention operation. Unlike prior work that uses feature extractors pretrained for other tasks for the conditioning signal, AV-Link can directly leverage features obtained by the complementary modality in a single framework i.e. video features to generate audio, or audio features to generate video. We extensively evaluate our design choices and demonstrate the ability of our method to achieve synchronized and high-quality audiovisual content, showcasing its potential for applications in immersive media generation. Project Page: snap-research.github.io/AVLink/

Problem

Research questions and friction points this paper is trying to address.

Unified framework for cross-modal audio-video generation.

Bidirectional information exchange via temporally-aligned self-attention.

Improves audio-video synchronization without dedicated models.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for cross-modal audio-video generation

Fusion Block enables bidirectional information exchange

Leverages temporally-aligned diffusion features directly

🔎 Similar Papers

A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

2024-09-26arXiv.orgCitations: 4

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

2024-05-28Citations: 3

Apple

Cupertino, United States of America

AI Research Scientist, Video Generation and Post Training, FAIR