Magic 1-For-1: Generating One Minute Video Clips within One Minute

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

To address the challenges of slow inference, high GPU memory consumption, and degraded quality in long-video generation for text-to-video synthesis, this paper proposes Magic141, an efficient video generation framework. Magic141 decouples the task into two sequential stages: text-to-image and image-to-video, and introduces the first real-time paradigm—“generating one second of video per second.” Its core contributions are: (1) a multimodal prior injection mechanism that jointly encodes textual semantics and motion cues; (2) adversarial diffusion step distillation, drastically reducing the number of sampling steps; and (3) synergistic optimization via parameter sparsification and sliding-window inference. Experiments demonstrate that Magic141 generates 5-second videos in just 3 seconds and produces 1-minute high-definition videos end-to-end in under 60 seconds. Moreover, it achieves significantly improved motion coherence and visual fidelity compared to state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.

Problem

Research questions and friction points this paper is trying to address.

Optimizing video generation efficiency

Reducing computational costs

Improving video quality dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Text-to-image and image-to-video factorization

Multi-modal prior condition injection

Adversarial step distillation for latency reduction

🔎 Similar Papers

MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies