🤖 AI Summary
This work addresses the high computational cost of Diffusion Transformers in video generation, which hinders real-time streaming on mobile devices. To overcome this limitation, the authors propose S2DiT, an efficient streaming video generation model tailored for mobile deployment. Key innovations include a sandwich architecture, LinConv Hybrid Attention (LCHA), Strided Self-Attention (SSA), budget-aware dynamic programming search, and a 2-in-1 knowledge distillation framework. S2DiT achieves real-time performance exceeding 10 FPS on an iPhone while maintaining visual quality comparable to state-of-the-art server-grade models, substantially advancing the feasibility of high-quality on-device video generation.
📝 Abstract
Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.