Accelerating Controllable Generation via Hybrid-grained Cache

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Controlled generation models suffer from low inference efficiency due to the joint processing of control conditions and content synthesis. To address this, we propose a hybrid-granularity caching mechanism that jointly deploys block-level coarse-grained caching (reusing intermediate features) and prompt-level fine-grained caching (reusing cross-attention maps) within encoder-decoder architectures, enabling stride-skipping computation and feature reuse. Our method requires no architectural modifications or retraining, and is compatible with diverse control modalities. Evaluated on four benchmarks including COCO-Stuff, it reduces MACs by 63% (from 18.22T to 6.70T) while incurring ≤1.5% degradation in semantic fidelity—significantly improving the efficiency-quality trade-off. The core contribution lies in being the first to introduce multi-granularity caching into controlled generation inference, enabling efficient, near-lossless real-time generation.

Technology Category

Application Category

📝 Abstract

Controllable generative models have been widely used to improve the realism of synthetic visual content. However, such models must handle control conditions and content generation computational requirements, resulting in generally low generation efficiency. To address this issue, we propose a Hybrid-Grained Cache (HGC) approach that reduces computational overhead by adopting cache strategies with different granularities at different computational stages. Specifically, (1) we use a coarse-grained cache (block-level) based on feature reuse to dynamically bypass redundant computations in encoder-decoder blocks between each step of model reasoning. (2) We design a fine-grained cache (prompt-level) that acts within a module, where the fine-grained cache reuses cross-attention maps within consecutive reasoning steps and extends them to the corresponding module computations of adjacent steps. These caches of different granularities can be seamlessly integrated into each computational link of the controllable generation process. We verify the effectiveness of HGC on four benchmark datasets, especially its advantages in balancing generation efficiency and visual quality. For example, on the COCO-Stuff segmentation benchmark, our HGC significantly reduces the computational cost (MACs) by 63% (from 18.22T to 6.70T), while keeping the loss of semantic fidelity (quantized performance degradation) within 1.5%.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational overhead in controllable generative models

Improving generation efficiency while maintaining visual quality

Dynamically bypassing redundant computations with hybrid cache strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-grained cache reduces computational overhead

Coarse-grained cache bypasses redundant encoder-decoder computations

Fine-grained cache reuses cross-attention maps across steps

🔎 Similar Papers

FutureFill: Fast Generation from Convolutional Sequence Models