๐ค AI Summary
To address the high inference overhead and inefficient feature caching in diffusion Transformers (DiTs), this paper proposes a training-inference co-optimized caching framework that overcomes two key bottlenecks: temporal discontinuity and objective misalignment. We introduce stepwise denoising training (SDT) to enforce temporal consistency throughout the denoising process, andโnoveltyโwe propose image error proxy-guided optimization (IEPO), jointly optimizing both image fidelity and cache utilization. Integrating learnable feature caching with an image-free supervision strategy, we conduct comprehensive evaluation across eight DiT models, four samplers, and resolutions ranging from 256ร256 to 2K. Results show that our method reduces PixArt-ฮฑ inference latency by over 40% (achieving a theoretical 2.07ร speedup) and cuts training time by 25%, significantly improving end-to-end efficiency.
๐ Abstract
Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives--aligned predicted noise vs. high-quality images--between training and inference. These two discrepancies compromise both performance and efficiency. To this end, we harmonize training and inference with a novel learning-based caching framework dubbed HarmoniCa. It first incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an Image Error Proxy-Guided Objective (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256 imes256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40%$ latency reduction (i.e., $2.07 imes$ theoretical speedup) and improved performance on PixArt-$alpha$. Remarkably, our image-free approach reduces training time by $25%$ compared with the previous method.