🤖 AI Summary
Diffusion models suffer from high computational latency and memory overhead during inference; existing post-training quantization (PTQ) methods overlook their timestep-sensitive nature, leading to distorted denoising trajectories and degraded quantization accuracy. To address this, we propose the first timestep-aware PTQ framework: (i) a Timestep Information Block (TIB) explicitly models temporal dependencies; (ii) Timestep-Aware Reconstruction (TIAR) and Finite-Set Calibration (FSC) jointly optimize quantization parameters; and (iii) a cache-based timestep feature maintenance mechanism coupled with a perturbation-driven hybrid strategy selection ensures robust temporal fidelity. Extensive experiments across diverse diffusion architectures (e.g., DDPM, DDIM), datasets (CIFAR-10, ImageNet), and hardware platforms (GPU, edge accelerators) demonstrate substantial improvements in both quantization accuracy (+3.2–5.7 dB PSNR) and inference throughput (+2.1–3.8× speedup), while preserving timestep-specific dynamics. End-to-end generation quality closely matches floating-point baselines, achieving state-of-the-art temporal feature fidelity under 4-bit quantization.
📝 Abstract
The Diffusion models, widely used for image generation, face significant challenges related to their broad applicability due to prolonged inference times and high memory demands. Efficient Post-Training Quantization (PTQ) is crucial to address these issues. However, unlike traditional models, diffusion models critically rely on the time-step for the multi-round denoising. Typically, each time-step is encoded into a hypersensitive temporal feature by several modules. Despite this, existing PTQ methods do not optimize these modules individually. Instead, they employ unsuitable reconstruction objectives and complex calibration methods, leading to significant disturbances in the temporal feature and denoising trajectory, as well as reduced compression efficiency. To address these challenges, we introduce a novel quantization framework that includes three strategies: 1) TIB-based Maintenance: Based on our innovative Temporal Information Block (TIB) definition, Temporal Information-aware Reconstruction (TIAR) and Finite Set Calibration (FSC) are developed to efficiently align original temporal features. 2) Cache-based Maintenance: Instead of indirect and complex optimization for the related modules, pre-computing and caching quantized counterparts of temporal features are developed to minimize errors. 3) Disturbance-aware Selection: Employ temporal feature errors to guide a fine-grained selection between the two maintenance strategies for further disturbance reduction. This framework preserves most of the temporal information and ensures high-quality end-to-end generation. Extensive testing on various datasets, diffusion models and hardware confirms our superior performance and acceleration.