π€ AI Summary
Existing acceleration methods for diffusion models often compromise classification capability when compressing computation, struggling to balance generation quality and discriminative performance. This work proposes BiGainβa training-free, plug-and-play framework that jointly enhances both generative and classification performance in accelerated diffusion for the first time. At its core lies a spectrum-aware dual-operator compression mechanism: Laplacian gating enables token merging, while interpolation-extrapolation-based KV downsampling preserves high-frequency details and maintains low- and mid-frequency semantics. On ImageNet-1K, BiGain achieves a 7.15% absolute improvement in classification accuracy and a 0.34 reduction in FID (a 1.85% relative improvement) under a 70% token compression rate, significantly advancing the trade-off between speed and accuracy.
π Abstract
Acceleration methods for diffusion models (e.g., token merging or downsampling) typically optimize synthesis quality under reduced compute, yet often ignore discriminative capacity. We revisit token compression with a joint objective and present BiGain, a training-free, plug-and-play framework that preserves generation quality while improving classification in accelerated diffusion models. Our key insight is frequency separation: mapping feature-space signals into a frequency-aware representation disentangles fine detail from global semantics, enabling compression that respects both generative fidelity and discriminative utility. BiGain reflects this principle with two frequency-aware operators: (1) Laplacian-gated token merging, which encourages merges among spectrally smooth tokens while discouraging merges of high-contrast tokens, thereby retaining edges and textures; and (2) Interpolate-Extrapolate KV Downsampling, which downsamples keys/values via a controllable interextrapolation between nearest and average pooling while keeping queries intact, thereby conserving attention precision. Across DiT- and U-Net-based backbones and ImageNet-1K, ImageNet-100, Oxford-IIIT Pets, and COCO-2017, our operators consistently improve the speed-accuracy trade-off for diffusion-based classification, while maintaining or enhancing generation quality under comparable acceleration. For instance, on ImageNet-1K, with 70% token merging on Stable Diffusion 2.0, BiGain increases classification accuracy by 7.15% while improving FID by 0.34 (1.85%). Our analyses indicate that balanced spectral retention, preserving high-frequency detail and low/mid-frequency semantics, is a reliable design rule for token compression in diffusion models. To our knowledge, BiGain is the first framework to jointly study and advance both generation and classification under accelerated diffusion, supporting lower-cost deployment.