🤖 AI Summary
This work addresses the high computational overhead of Diffusion Transformers in real-world image super-resolution and the tendency of existing quantization methods to degrade local textures. The authors propose the first post-training quantization framework tailored for this task, leveraging a hierarchical SVD structure (H-SVD) and variance-aware spatiotemporal mixed-precision strategies (VaSMP/VaTMP) to enable data-free, efficient bit-width allocation and dynamic precision scheduling. By formulating cross-layer bit-width optimization and temporal activation precision planning through rate-distortion theory, the method significantly enhances both efficiency and fidelity. The approach achieves state-of-the-art performance under W4A6 and W4A4 configurations, with the W4A4 model yielding a 5.8× reduction in model size and over a 60× decrease in computational cost.
📝 Abstract
Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by over 60$\times$. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.