ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Deploying high-resolution diffusion models on mobile devices often compromises image quality due to substantial computational and memory demands. This work proposes a dynamically tunable single-model architecture that jointly modulates spatial compression ratio and network depth, enabling a broad quality–latency trade-off within a unified model and eliminating the need for multiple deployments. Key innovations include an elastic architecture, Shift Sparse Block Attention (SSBA) achieving an average sparsity of 84.16%, a lightweight Tiny DWT-Distilled VAE (T-DVAE) that matches SD3-level reconstruction quality at one-eighth the computational cost, and Flow-GRPO optimization for semantic alignment. The Flex Lite variant attains an HPS score of 32.87, surpassing FLUX, and improves the GenEval score from 66.93 to 73.62.

📝 Abstract

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices entails prohibitive computational and memory overhead. While efficiency-driven approaches like Linear-DiT and static pruning alleviate bottlenecks, they often incur quality degradation. Unlike cloud environments, mobile constraints require a single-model paradigm that dynamically balances fidelity and latency. We introduce ElasticDiT, which achieves this dynamic trade-off by adjusting spatial compression ratios and DiT block depths. By integrating Shift Sparse Block Attention (SSBA) and a Tiny DWT-Distilled VAE (T-DVAE), ElasticDiT reduces inference latency and memory footprint while maintaining image quality. Experiments confirm that ElasticDiT effectively covers a wide range of fidelity-latency trade-offs within a single set of parameters. By jointly adjusting compression and depth, a single ElasticDiT model can be reconfigured on-the-fly to outperform task-specific baselines. Specifically, our flex lite variant achieves an HPS of 32.87, surpassing the Flux model, while maintaining competitive quality at 84.16 percent average sparsity through SSBA. Furthermore, the plug-and-play T-DVAE provides SD3-level reconstruction with only 1/8x the computational cost of standard VAEs, and Flow-GRPO boosts semantic alignment (GenEval: 66.93 to 73.62). These results demonstrate that ElasticDiT offers a versatile, hardware-adaptive solution that eliminates the need for multiple specialized models, providing a promising path for future high-resolution image generation on mobile devices.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformer

mobile deployment

computational efficiency

memory overhead

fidelity-latency trade-off

Innovation

Methods, ideas, or system contributions that make the work stand out.

Elastic Architecture

Sparse Attention

Mobile Diffusion