🤖 AI Summary
This work addresses the computational bottleneck in context-aware generation with Diffusion Transformers (DiT), where concatenating reference images leads to excessively long input sequences. Existing token compression methods overlook the inherent asymmetry between reference and target tokens in spatial, temporal, and functional roles. To resolve this, the authors propose ToPi, a training-free token pruning framework that identifies quality-sensitive attention layers via offline calibration, constructs a context token influence evaluation mechanism, and employs a temporally adaptive strategy to dynamically adjust pruning decisions. As the first non-uniform pruning approach tailored for DiT-based context generation, ToPi explicitly models the role disparity between reference and target tokens, achieving over 30% inference speedup while preserving structural fidelity and visual consistency.
📝 Abstract
In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30\% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.