🤖 AI Summary
In diffusion models, U-Net parameters exhibit strong temporal (timestep) and sample dependence—early layers govern structural modeling, while later layers refine texture details; yet conventional full parameter sharing induces redundancy and interference. Method: We first observe that large parameter counts are not inherently critical, and propose MaskUNet: a dynamic sparse masking mechanism that adaptively zeros out non-critical parameters during denoising. It introduces minimal auxiliary parameters and supports both training-based and training-free fine-tuning strategies. Results: On COCO zero-shot generation, MaskUNet achieves state-of-the-art FID (12.3% improvement over baseline), significantly enhancing texture fidelity and detail quality, with robust generalization to downstream tasks. Our core contribution is the identification of timestep-sensitive parameter roles in diffusion processes and the establishment of the first lightweight, sample- and timestep-jointly adaptive U-Net sparsification paradigm.
📝 Abstract
The diffusion models, in early stages focus on constructing basic image structures, while the refined details, including local features and textures, are generated in later stages. Thus the same network layers are forced to learn both structural and textural information simultaneously, significantly differing from the traditional deep learning architectures (e.g., ResNet or GANs) which captures or generates the image semantic information at different layers. This difference inspires us to explore the time-wise diffusion models. We initially investigate the key contributions of the U-Net parameters to the denoising process and identify that properly zeroing out certain parameters (including large parameters) contributes to denoising, substantially improving the generation quality on the fly. Capitalizing on this discovery, we propose a simple yet effective method-termed ``MaskUNet''- that enhances generation quality with negligible parameter numbers. Our method fully leverages timestep- and sample-dependent effective U-Net parameters. To optimize MaskUNet, we offer two fine-tuning strategies: a training-based approach and a training-free approach, including tailored networks and optimization functions. In zero-shot inference on the COCO dataset, MaskUNet achieves the best FID score and further demonstrates its effectiveness in downstream task evaluations. Project page: https://gudaochangsheng.github.io/MaskUnet-Page/