🤖 AI Summary
This work introduces the first high-fidelity talking video generation framework supporting arbitrarily long durations, multiple subject types (realistic portraits, full-body figures, stylized anime), and multi-view synthesis—including back-facing views. To address key challenges—temporal inconsistency in long sequences, weak cross-style generalization, and coarse-grained speaker control—it proposes three innovations: (1) a 3D sliding-window denoising mechanism to ensure long-range temporal coherence; (2) a two-stage multimodal curriculum learning strategy with region-adaptive masked loss to enhance lip-sync accuracy and identity preservation; and (3) a diffusion Transformer (DiT)-based, audio–text–image tri-modal joint driving architecture, incorporating 3D full-attention and unified-step classifier-free guidance (CFG) distillation. Evaluated on a new benchmark, our method significantly outperforms state-of-the-art approaches: it generates 10-second, 540p videos in just 10 seconds using 8× H100 GPUs—achieving a 20× inference speedup without quality degradation.
📝 Abstract
We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/.