🤖 AI Summary
This work addresses the high computational and memory costs of fine-tuning Diffusion Transformers (DiT), which hinder their deployment on resource-constrained devices. To this end, the authors propose DiT-BlockSkip, a novel framework that integrates timestep-aware dynamic patch sampling with a cross-attention mask–based mechanism for selectively activating critical model blocks. The approach further incorporates residual feature precomputation to enable efficient block skipping during training. This design substantially reduces memory consumption while preserving competitive performance in personalized image generation, offering a practical pathway toward efficient fine-tuning of large-scale diffusion models on edge devices.
📝 Abstract
Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward & backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.