Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing audio-visual video generation methods rely on high-quality, category-limited annotated videos, hindering generalization to open-world audio-visual categories. Method: We propose a two-stage efficient training paradigm: (1) self-supervised pretraining on large-scale noisy web videos, incorporating video quality filtering and multi-modal conditional modeling (audio + visual features); (2) fine-tuning on a small set of precisely annotated videos, enhanced by a lightweight windowed attention mechanism to improve audio-video temporal alignment. Contribution/Results: Our approach extends audio controllability with only a 1.9% increase in trainable parameters. Leveraging pretrained text-to-video models and an audio encoder, we introduce AVSync48—a new benchmark comprising 48 diverse audio-visual classes. Experiments demonstrate over a 10× reduction in manual annotation effort and significantly improved zero-shot generalization to unseen categories.

Technology Category

Application Category

📝 Abstract

Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos, posing challenges to scaling up to diverse audio-video classes in the open world. In this work, we propose an efficient two-stage training paradigm to scale up audio-synchronized visual animation using abundant but noisy videos. In stage one, we automatically curate large-scale videos for pretraining, allowing the model to learn diverse but imperfect audio-video alignments. In stage two, we finetune the model on manually curated high-quality examples, but only at a small scale, significantly reducing the required human effort. We further enhance synchronization by allowing each frame to access rich audio context via multi-feature conditioning and window attention. To efficiently train the model, we leverage pretrained text-to-video generator and audio encoders, introducing only 1.9% additional trainable parameters to learn audio-conditioning capability without compromising the generator's prior knowledge. For evaluation, we introduce AVSync48, a benchmark with videos from 48 classes, which is 3$ imes$ more diverse than previous benchmarks. Extensive experiments show that our method significantly reduces reliance on manual curation by over 10$ imes$, while generalizing to many open classes.

Problem

Research questions and friction points this paper is trying to address.

Reducing reliance on expensive manual video curation

Scaling audio-synchronized animation to diverse classes

Improving synchronization with noisy training data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with noisy then curated data

Multi-feature conditioning for better synchronization

Minimal parameter addition to pretrained models

🔎 Similar Papers

No similar papers found.