Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

📅 2025-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of precise, concurrent control over multiple continuous attributes (e.g., eye openness, car width) in novel domains guided by text. To this end, we propose Att-Adapter—a plug-and-play module. Methodologically, we introduce a novel decoupled cross-attention mechanism that enables natural, fine-grained coordination between textual conditions and multiple attributes; incorporate a conditional variational autoencoder (CVAE) to mitigate overfitting, enabling training on unpaired data, single-model multi-attribute generalization, and strongly disentangled control; and adopt a LoRA-compatible architecture with unsupervised multi-attribute representation learning. Evaluated on two public benchmarks, our approach significantly outperforms LoRA-based baselines—achieving broader attribute controllability, superior disentanglement, and higher fidelity than StyleGAN-based methods—while requiring no synthetic paired data.

Technology Category

Application Category

📝 Abstract
Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g., numeric values like eye openness or car width) with text-only guidance remains a significant challenge. To address this, we introduce the Attribute (Att) Adapter, a novel plug-and-play module designed to enable fine-grained, multi-attributes control in pretrained diffusion models. Our approach learns a single control adapter from a set of sample images that can be unpaired and contain multiple visual attributes. The Att-Adapter leverages the decoupled cross attention module to naturally harmonize the multiple domain attributes with text conditioning. We further introduce Conditional Variational Autoencoder (CVAE) to the Att-Adapter to mitigate overfitting, matching the diverse nature of the visual world. Evaluations on two public datasets show that Att-Adapter outperforms all LoRA-based baselines in controlling continuous attributes. Additionally, our method enables a broader control range and also improves disentanglement across multiple attributes, surpassing StyleGAN-based techniques. Notably, Att-Adapter is flexible, requiring no paired synthetic data for training, and is easily scalable to multiple attributes within a single model.
Problem

Research questions and friction points this paper is trying to address.

Enables precise control of multiple continuous attributes in text-to-image models.
Introduces a plug-and-play module for fine-grained multi-attributes control.
Mitigates overfitting using Conditional Variational Autoencoder (CVAE).
Innovation

Methods, ideas, or system contributions that make the work stand out.

Plug-and-play module for multi-attributes control
Uses Conditional Variational Autoencoder to prevent overfitting
Leverages decoupled cross attention for attribute harmonization
🔎 Similar Papers
No similar papers found.