Memory-Efficient Fine-Tuning Diffusion Transformers via Dynamic Patch Sampling and Block Skipping

📅 2026-03-21

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the high computational and memory costs of fine-tuning Diffusion Transformers (DiT), which hinder their deployment on resource-constrained devices. To this end, the authors propose DiT-BlockSkip, a novel framework that integrates timestep-aware dynamic patch sampling with a cross-attention mask–based mechanism for selectively activating critical model blocks. The approach further incorporates residual feature precomputation to enable efficient block skipping during training. This design substantially reduces memory consumption while preserving competitive performance in personalized image generation, offering a practical pathway toward efficient fine-tuning of large-scale diffusion models on edge devices.

Technology Category

Application Category

📝 Abstract

Diffusion Transformers (DiTs) have significantly enhanced text-to-image (T2I) generation quality, enabling high-quality personalized content creation. However, fine-tuning these models requires substantial computational complexity and memory, limiting practical deployment under resource constraints. To tackle these challenges, we propose a memory-efficient fine-tuning framework called DiT-BlockSkip, integrating timestep-aware dynamic patch sampling and block skipping by precomputing residual features. Our dynamic patch sampling strategy adjusts patch sizes based on the diffusion timestep, then resizes the cropped patches to a fixed lower resolution. This approach reduces forward & backward memory usage while allowing the model to capture global structures at higher timesteps and fine-grained details at lower timesteps. The block skipping mechanism selectively fine-tunes essential transformer blocks and precomputes residual features for the skipped blocks, significantly reducing training memory. To identify vital blocks for personalization, we introduce a block selection strategy based on cross-attention masking. Evaluations demonstrate that our approach achieves competitive personalization performance qualitatively and quantitatively, while reducing memory usage substantially, moving toward on-device feasibility (e.g., smartphones, IoT devices) for large-scale diffusion transformers.

Problem

Research questions and friction points this paper is trying to address.

Diffusion Transformers

memory-efficient fine-tuning

text-to-image generation

computational complexity

resource constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Patch Sampling

Block Skipping

Memory-Efficient Fine-Tuning