S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

📅 2026-01-19

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work addresses the high computational cost of Diffusion Transformers in video generation, which hinders real-time streaming on mobile devices. To overcome this limitation, the authors propose S2DiT, an efficient streaming video generation model tailored for mobile deployment. Key innovations include a sandwich architecture, LinConv Hybrid Attention (LCHA), Strided Self-Attention (SSA), budget-aware dynamic programming search, and a 2-in-1 knowledge distillation framework. S2DiT achieves real-time performance exceeding 10 FPS on an iPhone while maintaining visual quality comparable to state-of-the-art server-grade models, substantially advancing the feasibility of high-quality on-device video generation.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion Transformer designed for efficient, high-fidelity, and streaming video generation on mobile hardware. S2DiT generates more tokens but maintains efficiency with novel efficient attentions: a mixture of LinConv Hybrid Attention (LCHA) and Stride Self-Attention (SSA). Based on this, we uncover the sandwich design via a budget-aware dynamic programming search, achieving superior quality and efficiency. We further propose a 2-in-1 distillation framework that transfers the capacity of large teacher models (e.g., Wan 2.2-14B) to the compact few-step sandwich model. Together, S2DiT achieves quality on par with state-of-the-art server video models, while streaming at over 10 FPS on an iPhone.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers

video generation

mobile streaming

computational efficiency

on-device inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformer

Efficient Attention

Sandwich Architecture