SnapGen++: Unleashing Diffusion Transformers for Efficient High-Fidelity Image Generation on Edge Devices

📅 2026-01-13
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Although Diffusion Transformers (DiTs) achieve impressive performance in image generation, their high computational and memory demands hinder deployment on edge devices. To address this challenge, this work proposes an efficient DiT framework featuring three key innovations: an adaptive global-local sparse attention mechanism to reduce computational complexity, an elastic training strategy within a unified hypernetwork enabling dynamic model scaling, and a four-step generative approach—KG-DMD—that integrates distribution matching with knowledge distillation. The resulting framework enables high-fidelity image generation in just four steps across diverse edge hardware platforms, significantly improving inference efficiency while maintaining visual quality and achieving a favorable balance between real-time performance and generation fidelity.

Technology Category

Application Category

📝 Abstract
Recent advances in diffusion transformers (DiTs) have set new standards in image generation, yet remain impractical for on-device deployment due to their high computational and memory costs. In this work, we present an efficient DiT framework tailored for mobile and edge devices that achieves transformer-level generation quality under strict resource constraints. Our design combines three key components. First, we propose a compact DiT architecture with an adaptive global-local sparse attention mechanism that balances global context modeling and local detail preservation. Second, we propose an elastic training framework that jointly optimizes sub-DiTs of varying capacities within a unified supernetwork, allowing a single model to dynamically adjust for efficient inference across different hardware. Finally, we develop Knowledge-Guided Distribution Matching Distillation, a step-distillation pipeline that integrates the DMD objective with knowledge transfer from few-step teacher models, producing high-fidelity and low-latency generation (e.g., 4-step) suitable for real-time on-device use. Together, these contributions enable scalable, efficient, and high-quality diffusion models for deployment on diverse hardware.
Problem

Research questions and friction points this paper is trying to address.

diffusion transformers
edge devices
efficient deployment
high-fidelity image generation
resource constraints
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion Transformers
Sparse Attention
Elastic Training
Knowledge Distillation
Edge Deployment
🔎 Similar Papers
No similar papers found.