CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the high computational overhead of Diffusion Transformers, which hinders their deployment on edge devices and limits high-resolution generation. The authors propose a structured token pruning framework that identifies redundant tokens via a spatial consistency score computed in linear time and reconstructs their attention outputs by leveraging the consistency of neighboring preserved tokens. Combined with a block-adaptive progressive pruning schedule, the method achieves up to a 55% reduction in self-attention FLOPs on models such as PixArt-α and MagicDrive-V2, yielding a 1.33× speedup on cloud GPUs and a 1.72× speedup on mobile NPUs while preserving generation quality and significantly improving memory efficiency.

📝 Abstract

Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-α and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers

computational cost

on-device deployment

token redundancy

efficient inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

token pruning

spatial coherence

diffusion transformers