ALIVE: Animate Your World with Lifelike Audio-Video Generation

📅 2026-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of existing text-to-video models, which struggle to generate high-quality audio-visual content in a synchronized manner and lack reference-driven animation capabilities. The authors propose a unified framework built upon an extended MMDiT architecture that supports both text-to-audiovisual generation and reference-guided animation. Key innovations include the introduction of TA-CrossAttn, a novel cross-modal attention mechanism, and UniTemp-RoPE, a temporal positional encoding scheme, which together enable precise audio-visual temporal alignment and fusion. The study also constructs a high-quality audiovisual fine-tuning data pipeline and establishes a new evaluation benchmark. After pretraining and fine-tuning on data at million-scale, the model surpasses existing open-source approaches and achieves performance on par with or better than leading commercial systems across multiple metrics.

Technology Category

Application Category

📝 Abstract
Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.
Problem

Research questions and friction points this paper is trying to address.

audio-video generation
text-to-video
reference animation
audio-visual synchronization
multimodal generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

audio-video generation
temporal alignment
cross-modal fusion
reference animation
MMDiT architecture
🔎 Similar Papers
No similar papers found.