🤖 AI Summary
This work addresses the limitations of video diffusion models, which suffer from high computational complexity due to attention mechanisms and substantial overhead from multi-step sampling. To overcome these challenges, the authors propose EFlow, a framework that enables few-step video generation training through an efficient decoupled flow objective. The method introduces a gated local-global attention module that supports random token dropping to enhance computational efficiency. Additionally, path-drop guided training and a mean-velocity additivity regularizer are designed to preserve generation quality even at extremely low inference step counts. Experiments demonstrate that EFlow achieves competitive performance on Kinetics and large-scale text-to-video datasets, while offering a 2.5× improvement in training throughput and a 45.3× reduction in inference latency.
📝 Abstract
Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.