🤖 AI Summary
To address the challenges of slow inference, high GPU memory consumption, and degraded quality in long-video generation for text-to-video synthesis, this paper proposes Magic141, an efficient video generation framework. Magic141 decouples the task into two sequential stages: text-to-image and image-to-video, and introduces the first real-time paradigm—“generating one second of video per second.” Its core contributions are: (1) a multimodal prior injection mechanism that jointly encodes textual semantics and motion cues; (2) adversarial diffusion step distillation, drastically reducing the number of sampling steps; and (3) synergistic optimization via parameter sparsification and sliding-window inference. Experiments demonstrate that Magic141 generates 5-second videos in just 3 seconds and produces 1-minute high-definition videos end-to-end in under 60 seconds. Moreover, it achieves significantly improved motion coherence and visual fidelity compared to state-of-the-art baselines.
📝 Abstract
In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.