π€ AI Summary
Diffusion models suffer from low inference efficiency due to iterative sampling; existing feature caching approaches rely on temporal extrapolation but struggle to accurately model complex feature dynamics, often compromising generation quality. To address this, we propose HiCacheβa training-free inference acceleration framework. Leveraging the empirical observation that feature derivatives approximately follow a Gaussian distribution, HiCache employs theoretically optimal Hermite polynomial expansions to model feature evolution. A dual-scale mechanism is introduced to ensure both numerical stability and prediction accuracy. Furthermore, HiCache is specifically adapted to the Diffusion Transformer architecture for efficient feature prediction. Evaluated on FLUX.1-dev, HiCache achieves a 6.24Γ speedup while surpassing baseline generation quality. Its effectiveness and generalizability are further validated across text-to-image synthesis, video generation, and super-resolution tasks.
π Abstract
Diffusion models have achieved remarkable success in content generation but suffer from prohibitive computational costs due to iterative sampling. While recent feature caching methods tend to accelerate inference through temporal extrapolation, these methods still suffer from server quality loss due to the failure in modeling the complex dynamics of feature evolution. To solve this problem, this paper presents HiCache, a training-free acceleration framework that fundamentally improves feature prediction by aligning mathematical tools with empirical properties. Our key insight is that feature derivative approximations in Diffusion Transformers exhibit multivariate Gaussian characteristics, motivating the use of Hermite polynomials-the potentially theoretically optimal basis for Gaussian-correlated processes. Besides, We further introduce a dual-scaling mechanism that ensures numerical stability while preserving predictive accuracy. Extensive experiments demonstrate HiCache's superiority: achieving 6.24x speedup on FLUX.1-dev while exceeding baseline quality, maintaining strong performance across text-to-image, video generation, and super-resolution tasks. Core implementation is provided in the appendix, with complete code to be released upon acceptance.