S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

📅 2026-01-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of Diffusion Transformers in video generation, which hinders real-time streaming on mobile devices. To overcome this limitation, the authors propose S2DiT, an efficient streaming video generation model tailored for mobile deployment. Key innovations include a sandwich architecture, LinConv Hybrid Attention (LCHA), Strided Self-Attention (SSA), budget-aware dynamic programming search, and a 2-in-1 knowledge distillation framework. S2DiT achieves real-time performance exceeding 10 FPS on an iPhone while maintaining visual quality comparable to state-of-the-art server-grade models, substantially advancing the feasibility of high-quality on-device video generation.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers
video generation
mobile streaming
computational efficiency
on-device inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer
Efficient Attention
Sandwich Architecture
Mobile Video Generation
Model Distillation
🔎 Similar Papers
No similar papers found.