CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

📅 2024-10-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing controllable image generation methods (e.g., ControlNet) require separate, resource-intensive training for each control condition—typically demanding millions of annotated samples and hundreds of GPU-hours—severely hindering rapid exploration of novel conditions. To address this, we propose a two-stage framework: Base ControlNet+, which first learns a general-purpose control backbone via multi-condition joint pretraining, then employs condition-specific LoRA modules for parameter-efficient fine-tuning. Our approach enables customization to new control modalities (e.g., edge, depth, pose) using only ~1,000 samples and under one hour on a single GPU. It reduces trainable parameters by 90% while matching ControlNet’s performance across diverse control tasks. This significantly improves generalization, scalability, and practical deployability of controllable diffusion models—particularly in resource-constrained settings.

Technology Category

Application Category

📝 Abstract
Recently, large-scale diffusion models have made impressive progress in text-to-image (T2I) generation. To further equip these T2I models with fine-grained spatial control, approaches like ControlNet introduce an extra network that learns to follow a condition image. However, for every single condition type, ControlNet requires independent training on millions of data pairs with hundreds of GPU hours, which is quite expensive and makes it challenging for ordinary users to explore and develop new types of conditions. To address this problem, we propose the CtrLoRA framework, which trains a Base ControlNet to learn the common knowledge of image-to-image generation from multiple base conditions, along with condition-specific LoRAs to capture distinct characteristics of each condition. Utilizing our pretrained Base ControlNet, users can easily adapt it to new conditions, requiring as few as 1,000 data pairs and less than one hour of single-GPU training to obtain satisfactory results in most scenarios. Moreover, our CtrLoRA reduces the learnable parameters by 90% compared to ControlNet, significantly lowering the threshold to distribute and deploy the model weights. Extensive experiments on various types of conditions demonstrate the efficiency and effectiveness of our method. Codes and model weights will be released at https://github.com/xyfJASON/ctrlora.
Problem

Research questions and friction points this paper is trying to address.

High cost of training ControlNet for new conditions
Difficulty for users to explore new condition types
Reduction of learnable parameters by 90%
Innovation

Methods, ideas, or system contributions that make the work stand out.

Base ControlNet learns common image generation knowledge
Condition-specific LoRAs capture distinct condition characteristics
Reduces learnable parameters by 90% compared to ControlNet
🔎 Similar Papers
Y
Yifeng Xu
Key Lab of AI Safety, Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China
Zhenliang He
Zhenliang He
Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionAIGC
Shiguang Shan
Shiguang Shan
Professor of Institute of Computing Technology, Chinese Academy of Sciences
Computer VisionPattern RecognitionMachine LearningFace Recognition
X
Xilin Chen
Key Lab of AI Safety, Institute of Computing Technology, CAS, China; University of Chinese Academy of Sciences, China