🤖 AI Summary
This work addresses the challenge of efficient fine-tuning of pretrained text-to-image diffusion models under limited data, aiming to jointly achieve concept personalization, preservation of multi-task instruction-following capability, and enhanced generation editability. We propose a decomposition-based fine-tuning framework that decouples weight updates into two orthogonal components: (i) orthogonal projection onto a trainable low-dimensional subspace, and (ii) low-rank adaptation within that subspace—implemented via two compact low-rank matrices. This design enables parameter-efficient adaptation while promoting representation disentanglement. The method integrates trainable subspace modeling with low-rank decomposition and is compatible with mainstream architectures such as Stable Diffusion. On benchmarks including DreamBooth and InsDet, our approach significantly outperforms existing efficient fine-tuning methods. It demonstrates superior generalization across concept learning, scene editing, and visual-contextual generation tasks, and exhibits emergent editability—achieving state-of-the-art performance with only a minimal number of trainable parameters.
📝 Abstract
Efficient fine-tuning of pre-trained Text-to-Image (T2I) models involves adjusting the model to suit a particular task or dataset while minimizing computational resources and limiting the number of trainable parameters. However, it often faces challenges in striking a trade-off between aligning with the target distribution: learning a novel concept from a limited image for personalization and retaining the instruction ability needed for unifying multiple tasks, all while maintaining editability (aligning with a variety of prompts or in-context generation). In this work, we introduce DEFT, Decompositional Efficient Fine-Tuning, an efficient fine-tuning framework that adapts a pre-trained weight matrix by decomposing its update into two components with two trainable matrices: (1) a projection onto the complement of a low-rank subspace spanned by a low-rank matrix, and (2) a low-rank update. The single trainable low-rank matrix defines the subspace, while the other trainable low-rank matrix enables flexible parameter adaptation within that subspace. We conducted extensive experiments on the Dreambooth and Dreambench Plus datasets for personalization, the InsDet dataset for object and scene adaptation, and the VisualCloze dataset for a universal image generation framework through visual in-context learning with both Stable Diffusion and a unified model. Our results demonstrated state-of-the-art performance, highlighting the emergent properties of efficient fine-tuning. Our code is available on href{https://github.com/MAXNORM8650/DEFT}{DEFTBase}.