MagicInfinite: Generating Infinite Talking Videos with Your Words and Voice

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work introduces the first high-fidelity talking video generation framework supporting arbitrarily long durations, multiple subject types (realistic portraits, full-body figures, stylized anime), and multi-view synthesis—including back-facing views. To address key challenges—temporal inconsistency in long sequences, weak cross-style generalization, and coarse-grained speaker control—it proposes three innovations: (1) a 3D sliding-window denoising mechanism to ensure long-range temporal coherence; (2) a two-stage multimodal curriculum learning strategy with region-adaptive masked loss to enhance lip-sync accuracy and identity preservation; and (3) a diffusion Transformer (DiT)-based, audio–text–image tri-modal joint driving architecture, incorporating 3D full-attention and unified-step classifier-free guidance (CFG) distillation. Evaluated on a new benchmark, our method significantly outperforms state-of-the-art approaches: it generates 10-second, 540p videos in just 10 seconds using 8× H100 GPUs—achieving a 20× inference speedup without quality degradation.

Technology Category

Application Category

📝 Abstract

We present MagicInfinite, a novel diffusion Transformer (DiT) framework that overcomes traditional portrait animation limitations, delivering high-fidelity results across diverse character types-realistic humans, full-body figures, and stylized anime characters. It supports varied facial poses, including back-facing views, and animates single or multiple characters with input masks for precise speaker designation in multi-character scenes. Our approach tackles key challenges with three innovations: (1) 3D full-attention mechanisms with a sliding window denoising strategy, enabling infinite video generation with temporal coherence and visual quality across diverse character styles; (2) a two-stage curriculum learning scheme, integrating audio for lip sync, text for expressive dynamics, and reference images for identity preservation, enabling flexible multi-modal control over long sequences; and (3) region-specific masks with adaptive loss functions to balance global textual control and local audio guidance, supporting speaker-specific animations. Efficiency is enhanced via our innovative unified step and cfg distillation techniques, achieving a 20x inference speed boost over the basemodel: generating a 10 second 540x540p video in 10 seconds or 720x720p in 30 seconds on 8 H100 GPUs, without quality loss. Evaluations on our new benchmark demonstrate MagicInfinite's superiority in audio-lip synchronization, identity preservation, and motion naturalness across diverse scenarios. It is publicly available at https://www.hedra.com/, with examples at https://magicinfinite.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Generates infinite talking videos with high fidelity.

Supports diverse character types and facial poses.

Enhances efficiency with 20x inference speed boost.

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D full-attention with sliding window denoising

Two-stage curriculum learning for multi-modal control

Region-specific masks with adaptive loss functions

🔎 Similar Papers

The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives