🤖 AI Summary
This work addresses the performance degradation of existing time-extrapolation-based feature caching methods in accelerating Diffusion Transformers (DiTs), which suffer from large prediction errors due to irregular changes in output features. To overcome this limitation, the authors propose a relational feature caching framework that explicitly models the strong correlation between module inputs and outputs for the first time. By introducing Relational Feature Estimation (RFE) and Relational Cache Scheduling (RCS) mechanisms, the framework enables more accurate prediction and scheduling of feature reuse. This approach substantially reduces redundant computation and cache-induced errors, achieving significant inference speedups across various DiT architectures while preserving generation quality, thereby outperforming current caching strategies.
📝 Abstract
Feature caching approaches accelerate diffusion transformers (DiTs) by storing the output features of computationally expensive modules at certain timesteps, and exploiting them for subsequent steps to reduce redundant computations. Recent forecasting-based caching approaches employ temporal extrapolation techniques to approximate the output features with cached ones. Although effective, relying exclusively on temporal extrapolation still suffers from significant prediction errors, leading to performance degradation. Through a detailed analysis, we find that 1) these errors stem from the irregular magnitude of changes in the output features, and 2) an input feature of a module is strongly correlated with the corresponding output. Based on this, we propose relational feature caching (RFC), a novel framework that leverages the input-output relationship to enhance the accuracy of the feature prediction. Specifically, we introduce relational feature estimation (RFE) to estimate the magnitude of changes in the output features from the inputs, enabling more accurate feature predictions. We also present relational cache scheduling (RCS), which estimates the prediction errors using the input features and performs full computations only when the errors are expected to be substantial. Extensive experiments across various DiT models demonstrate that RFC consistently outperforms prior approaches significantly. Project page is available at https://cvlab.yonsei.ac.kr/projects/RFC