CO3: Contrasting Concepts Compose Better

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

Text-to-image diffusion models often suffer from concept omission, semantic fusion artifacts, or dominance imbalance when generating images from multi-concept prompts (e.g., “a cat and a dog”). To address this without model retraining, we propose a plug-and-play sampling strategy grounded in the classifier-free guidance framework. Our method introduces a contrastive concept guidance mechanism that dynamically identifies and avoids regions of unstable guidance weights, steering the generation process toward clean, balanced joint representations where all concepts co-occur faithfully. The core innovation lies in lightweight trajectory modulation, enabling semantic decoupling and synergistic optimization across multiple concepts. Experiments demonstrate substantial improvements in concept coverage and visual balance across diverse multi-concept prompts, outperforming standard DDIM and existing compositional generation methods. These results validate both the effectiveness and generalizability of our approach.

Technology Category

Application Category

📝 Abstract

We propose to improve multi-concept prompt fidelity in text-to-image diffusion models. We begin with common failure cases-prompts like "a cat and a dog" that sometimes yields images where one concept is missing, faint, or colliding awkwardly with another. We hypothesize that this happens when the diffusion model drifts into mixed modes that over-emphasize a single concept it learned strongly during training. Instead of re-training, we introduce a corrective sampling strategy that steers away from regions where the joint prompt behavior overlaps too strongly with any single concept in the prompt. The goal is to steer towards "pure" joint modes where all concepts can coexist with balanced visual presence. We further show that existing multi-concept guidance schemes can operate in unstable weight regimes that amplify imbalance; we characterize favorable regions and adapt sampling to remain within them. Our approach, CO3, is plug-and-play, requires no model tuning, and complements standard classifier-free guidance. Experiments on diverse multi-concept prompts indicate improvements in concept coverage, balance and robustness, with fewer dropped or distorted concepts compared to standard baselines and prior compositional methods. Results suggest that lightweight corrective guidance can substantially mitigate brittle semantic alignment behavior in modern diffusion systems.

Problem

Research questions and friction points this paper is trying to address.

Improving multi-concept prompt fidelity in diffusion models

Addressing concept imbalance and missing elements in generated images

Mitigating semantic misalignment without retraining through corrective sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Corrective sampling strategy steers away from overlapping concept regions

Adapts sampling to remain in stable weight regimes

Plug-and-play guidance without model tuning or retraining

🔎 Similar Papers

What to align in multimodal contrastive learning?