HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration

๐Ÿ“… 2024-10-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high inference overhead and inefficient feature caching in diffusion Transformers (DiTs), this paper proposes a training-inference co-optimized caching framework that overcomes two key bottlenecks: temporal discontinuity and objective misalignment. We introduce stepwise denoising training (SDT) to enforce temporal consistency throughout the denoising process, andโ€”noveltyโ€”we propose image error proxy-guided optimization (IEPO), jointly optimizing both image fidelity and cache utilization. Integrating learnable feature caching with an image-free supervision strategy, we conduct comprehensive evaluation across eight DiT models, four samplers, and resolutions ranging from 256ร—256 to 2K. Results show that our method reduces PixArt-ฮฑ inference latency by over 40% (achieving a theoretical 2.07ร— speedup) and cuts training time by 25%, significantly improving end-to-end efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration. Existing learning-based caching, though adaptive, overlooks the impact of the prior timestep. It also suffers from misaligned objectives--aligned predicted noise vs. high-quality images--between training and inference. These two discrepancies compromise both performance and efficiency. To this end, we harmonize training and inference with a novel learning-based caching framework dubbed HarmoniCa. It first incorporates Step-Wise Denoising Training (SDT) to ensure the continuity of the denoising process, where prior steps can be leveraged. In addition, an Image Error Proxy-Guided Objective (IEPO) is applied to balance image quality against cache utilization through an efficient proxy to approximate the image error. Extensive experiments across $8$ models, $4$ samplers, and resolutions from $256 imes256$ to $2K$ demonstrate superior performance and speedup of our framework. For instance, it achieves over $40%$ latency reduction (i.e., $2.07 imes$ theoretical speedup) and improved performance on PixArt-$alpha$. Remarkably, our image-free approach reduces training time by $25%$ compared with the previous method.
Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers
Computational Efficiency
Feature Caching
Innovation

Methods, ideas, or system contributions that make the work stand out.

HarmoniCa
Step-Wise Denoising Training
Image Error Proxy-Guided Objective
๐Ÿ”Ž Similar Papers
No similar papers found.