EFlow: Fast Few-Step Video Generator Training from Scratch via Efficient Solution Flow

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of video diffusion models, which suffer from high computational complexity due to attention mechanisms and substantial overhead from multi-step sampling. To overcome these challenges, the authors propose EFlow, a framework that enables few-step video generation training through an efficient decoupled flow objective. The method introduces a gated local-global attention module that supports random token dropping to enhance computational efficiency. Additionally, path-drop guided training and a mean-velocity additivity regularizer are designed to preserve generation quality even at extremely low inference step counts. Experiments demonstrate that EFlow achieves competitive performance on Kinetics and large-scale text-to-video datasets, while offering a 2.5× improvement in training throughput and a 45.3× reduction in inference latency.
📝 Abstract
Scaling video diffusion transformers is fundamentally bottlenecked by two compounding costs: the expensive quadratic complexity of attention per step, and the iterative sampling steps. In this work, we propose EFlow, an efficient few-step training framework, that tackles these bottlenecks simultaneously. To reduce sampling steps, we build on a solution-flow objective that learns a function mapping a noised state at time t to time s. Making this formulation computationally feasible and high-quality at video scale, however, demands two complementary innovations. First, we propose Gated Local-Global Attention, a token-droppable hybrid block which is efficient, expressive, and remains highly stable under aggressive random token-dropping, substantially reducing per-step compute. Second, we develop an efficient few-step training recipe. We propose Path-Drop Guided training to replace the expensive guidance target with a computationally cheap, weak path. Furthermore, we augment this with a Mean-Velocity Additivity regularizer to ensure high fidelity at extremely low step counts. Together, our EFlow enables a practical from-scratch training pipeline, achieving up to 2.5x higher training throughput over standard solution-flow, and 45.3x lower inference latency than standard iterative models with competitive performance on Kinetics and large-scale text-to-video datasets.
Problem

Research questions and friction points this paper is trying to address.

video diffusion
attention complexity
iterative sampling
training efficiency
inference latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

EFlow
Gated Local-Global Attention
Solution Flow
Few-Step Training
Video Diffusion Transformer