๐ค AI Summary
Video generation models heavily rely on cloud-based computational resources, making real-time deployment on mobile devices infeasible. Method: We propose the first edge-oriented lightweight video diffusion model, featuring a compact image backbone network, a learnable temporal layer architecture search mechanism, an adversarial fine-tuning algorithm, and aggressive denoising step compression to only four steps. The model contains just 0.6 billion parameters and is further optimized for mobile hardware via TensorRT. Contribution/Results: On an iPhone 16 Pro Max, it generates 5-second 1080p videos in under 5 secondsโmatching visual quality of state-of-the-art cloud-based large models while achieving over 100ร faster inference than GPU servers. This work marks the first practical, high-fidelity transition of video generation from cloud to edge, enabling real-time, low-latency, privacy-preserving mobile content creation.
๐ Abstract
We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.