🤖 AI Summary
This work addresses the high computational cost and inference latency of full attention mechanisms in video diffusion Transformers caused by long input sequences. To mitigate this, the authors propose an efficient attention architecture that integrates lightweight linear attention and sparse attention in parallel, coupled with an input-dependent dynamic gating mechanism to adaptively fuse the two. Remarkably, with only 2,000 video samples and 1,600 fine-tuning steps, the method achieves a 90% attention sparsity rate and a 1.72× speedup in inference while maintaining generation quality on par with the full-attention baseline. The core innovations lie in the dynamic gating fusion strategy and an efficient fine-tuning paradigm, which together significantly reduce computational overhead without compromising generative performance.
📝 Abstract
Diffusion Transformers have recently demonstrated remarkable performance in video generation. However, the long input sequences result in high computational latency due to the quadratic complexity of full attention. Various sparse attention mechanisms have been proposed. Training-free sparse attention is constrained by limited sparsity and thus offers modest acceleration, whereas training-based methods can reach much higher sparsity but demand substantial data and computation for training. In this work, we propose SALAD, introducing a lightweight linear attention branch in parallel with the sparse attention. By incorporating an input-dependent gating mechanism to finely balance the two branches, our method attains 90% sparsity and 1.72x inference speedup, while maintaining generation quality comparable to the full attention baseline. Moreover, our finetuning process is highly efficient, requiring only 2,000 video samples and 1,600 training steps with a batch size of 8.