SnapGen-V: Generating a Five-Second Video within Five Seconds on a Mobile Device

๐Ÿ“… 2024-12-13
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 4
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Video generation models heavily rely on cloud-based computational resources, making real-time deployment on mobile devices infeasible. Method: We propose the first edge-oriented lightweight video diffusion model, featuring a compact image backbone network, a learnable temporal layer architecture search mechanism, an adversarial fine-tuning algorithm, and aggressive denoising step compression to only four steps. The model contains just 0.6 billion parameters and is further optimized for mobile hardware via TensorRT. Contribution/Results: On an iPhone 16 Pro Max, it generates 5-second 1080p videos in under 5 secondsโ€”matching visual quality of state-of-the-art cloud-based large models while achieving over 100ร— faster inference than GPU servers. This work marks the first practical, high-fidelity transition of video generation from cloud to edge, enabling real-time, low-latency, privacy-preserving mobile content creation.

Technology Category

Application Category

๐Ÿ“ Abstract
We have witnessed the unprecedented success of diffusion-based video generation over the past year. Recently proposed models from the community have wielded the power to generate cinematic and high-resolution videos with smooth motions from arbitrary input prompts. However, as a supertask of image generation, video generation models require more computation and are thus hosted mostly on cloud servers, limiting broader adoption among content creators. In this work, we propose a comprehensive acceleration framework to bring the power of the large-scale video diffusion model to the hands of edge users. From the network architecture scope, we initialize from a compact image backbone and search out the design and arrangement of temporal layers to maximize hardware efficiency. In addition, we propose a dedicated adversarial fine-tuning algorithm for our efficient model and reduce the denoising steps to 4. Our model, with only 0.6B parameters, can generate a 5-second video on an iPhone 16 PM within 5 seconds. Compared to server-side models that take minutes on powerful GPUs to generate a single video, we accelerate the generation by magnitudes while delivering on-par quality.
Problem

Research questions and friction points this paper is trying to address.

Accelerate video generation on mobile devices
Reduce computational cost of diffusion models
Maintain quality while minimizing denoising steps
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compact image backbone with optimized temporal layers
Adversarial fine-tuning for efficient model performance
Reduced denoising steps to only 4 steps
๐Ÿ”Ž Similar Papers
No similar papers found.