Taming Diffusion Transformer for Real-Time Mobile Video Generation

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Diffusion Transformers (DiTs) achieve state-of-the-art performance in video generation but incur prohibitive computational overhead, hindering real-time inference on mobile devices. To address this, we propose an efficient DiT-based video generation framework tailored for mobile deployment. First, we design a lightweight variational autoencoder (VAE) to reduce latent-space dimensionality. Second, we introduce sensitivity-aware, three-stage structured pruning to jointly compress model parameters and FLOPs. Third, we propose a knowledge-distillation-guided, four-step adversarial sampling-step distillation method that reduces the number of denoising steps while preserving visual fidelity. Our framework achieves >10 FPS high-quality video generation on an iPhone 16 Pro Max—the first such result on consumer mobile hardware—outperforming all existing mobile diffusion models. This work establishes a new paradigm for real-time generative video synthesis under severe resource constraints.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiT) have shown strong performance in video generation tasks, but their high computational cost makes them impractical for resource-constrained devices like smartphones, and real-time generation is even more challenging. In this work, we propose a series of novel optimizations to significantly accelerate video generation and enable real-time performance on mobile platforms. First, we employ a highly compressed variational autoencoder (VAE) to reduce the dimensionality of the input data without sacrificing visual quality. Second, we introduce a KD-guided, sensitivity-aware tri-level pruning strategy to shrink the model size to suit mobile platform while preserving critical performance characteristics. Third, we develop an adversarial step distillation technique tailored for DiT, which allows us to reduce the number of inference steps to four. Combined, these optimizations enable our model to achieve over 10 frames per second (FPS) generation on an iPhone 16 Pro Max, demonstrating the feasibility of real-time, high-quality video generation on mobile devices.

Problem

Research questions and friction points this paper is trying to address.

Reduce computational cost for mobile video generation

Optimize model size for resource-constrained devices

Achieve real-time performance on smartphones

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compressed VAE reduces input dimensionality efficiently

Tri-level pruning shrinks model for mobile platforms

Adversarial step distillation cuts inference steps to four

🔎 Similar Papers

Latte: Latent Diffusion Transformer for Video Generation