ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

๐Ÿ“… 2024-05-27
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 4
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing text-to-image personalization methods suffer from semantic drift due to overfitting, leading to compositional failureโ€”e.g., omitting โ€œheadphonesโ€ in โ€œa dog wearing headphones.โ€ This work identifies semantic drift as the root cause of degraded compositional generalization. We propose a semantic-preserving fine-tuning framework: (1) explicit class embeddings guide novel concept alignment with the original semantic space; and (2) a semantic consistency loss enforces compositional integrity. The method requires only few-shot examples, is architecture-agnostic for diffusion models, and extends naturally to video generation. Experiments demonstrate substantial improvements in multi-condition compositional generation: quantitative metrics increase by 12.6%, while qualitative evaluations confirm simultaneous gains in semantic fidelity and out-of-distribution generalization.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (*e.g.*, headphone is missing when generating"a `dog wearing a headphone"). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (*e.g.*,"a dog wearing a headphone"), implying that the compositional ability only disappears after personalization tuning. We observe a semantic shift in the customized concept after fine-tuning, indicating that the personalized concept is not aligned with the original concept, and further show through theoretical analyses that this semantic shift leads to increased difficulty in sampling the joint conditional probability distribution, resulting in the loss of the compositional ability. Inspired by this finding, we present **ClassDiffusion**, a technique that leverages a **semantic preservation loss** to explicitly regulate the concept space when learning a new concept. Although simple, this approach effectively prevents semantic drift during the fine-tuning process of the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of fine-tuning models. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.
Problem

Research questions and friction points this paper is trying to address.

Overfitting in text-to-image customization models
Semantic shift after fine-tuning diffusion models
Loss of compositional ability in personalized concepts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic preservation loss prevents concept drift
ClassDiffusion enhances compositional ability in models
Extends to personalized video generation effectively
๐Ÿ”Ž Similar Papers
No similar papers found.