MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

📅 2024-04-07

🏛️ arXiv.org

📈 Citations: 27

✨ Influential: 2

career value

216K/year

🤖 AI Summary

Current text-to-video (T2V) models lack explicit modeling of physical laws, resulting in monotonous and physically implausible motion generation. To address this, we propose Metamorphic Video Generation—a novel framework that leverages time-lapse videos as supervisory signals, introducing natural evolutionary processes as explicit supervision for learning physical commonsense. Methodologically, we design MagicAdapter for spatiotemporal decoupled modeling; introduce a dynamic frame-sampling strategy to accommodate drastic deformations; develop a Magic text encoder to enhance semantic understanding of metamorphic concepts; and release ChronoMagic—the first dedicated temporal metamorphosis dataset. We perform physics-aware fine-tuning on pretrained T2V models, achieving significant improvements in physical plausibility and dynamic diversity across melting, growth, corrosion, and other metamorphic scenarios. Our approach establishes new state-of-the-art performance across multiple metrics, empirically validating time-lapse videos as effective physical evolution simulators.

Technology Category

Application Category

📝 Abstract

Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions. A largely overlooked problem in T2V is that existing models have not adequately encoded physical knowledge of the real world, thus generated videos tend to have limited motion and poor variations. In this paper, we propose extbf{MagicTime}, a metamorphic time-lapse video generation model, which learns real-world physics knowledge from time-lapse videos and implements metamorphic generation. First, we design a MagicAdapter scheme to decouple spatial and temporal training, encode more physical knowledge from metamorphic videos, and transform pre-trained T2V models to generate metamorphic videos. Second, we introduce a Dynamic Frames Extraction strategy to adapt to metamorphic time-lapse videos, which have a wider variation range and cover dramatic object metamorphic processes, thus embodying more physical knowledge than general videos. Finally, we introduce a Magic Text-Encoder to improve the understanding of metamorphic video prompts. Furthermore, we create a time-lapse video-text dataset called extbf{ChronoMagic}, specifically curated to unlock the metamorphic video generation ability. Extensive experiments demonstrate the superiority and effectiveness of MagicTime for generating high-quality and dynamic metamorphic videos, suggesting time-lapse video generation is a promising path toward building metamorphic simulators of the physical world. Code: https://github.com/PKU-YuanGroup/MagicTime

Problem

Research questions and friction points this paper is trying to address.

Encode real-world physics in time-lapse videos

Improve motion and variation in generated videos

Adapt T2V models for metamorphic video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

MagicAdapter decouples spatial and temporal training

Dynamic Frames Extraction adapts to metamorphic videos

Magic Text-Encoder enhances understanding of video prompts

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Sr. Research Engineer/Scientist (all levels), World Models

TikTok

San Jose, California

Authors to Follow