Chimera: Compositional Image Generation using Part-based Concepting

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Existing personalized image generation methods struggle to accurately compose specified components from multiple source images without user-provided masks or annotations. To address this, we propose a text-instructed, part-level controllable synthesis framework. Our approach introduces “semantic atoms”—structured part-subject pairs—to model local semantics, and constructs a large-scale training dataset comprising 464 semantic atoms and 37,000 image-text pairs. We design a part-conditioned custom diffusion prior model and introduce PartEval, a novel metric quantifying part alignment and compositional accuracy. Experiments demonstrate that our method achieves a 14% improvement in part composition accuracy and a 21% gain in visual quality over state-of-the-art baselines, as measured by both human evaluation and PartEval. To the best of our knowledge, this is the first work to enable fine-grained, cross-image part-level controllable synthesis without requiring manual annotations.

Technology Category

Application Category

📝 Abstract

Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.

Problem

Research questions and friction points this paper is trying to address.

Generating images by combining parts from multiple sources without masks

Enforcing semantic identity and spatial layout through conditional guidance

Assessing compositional accuracy in part-based image generation pipelines

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines object parts from multiple source images

Trains diffusion prior with part-conditional guidance

Introduces PartEval metric for compositional accuracy

🔎 Similar Papers

No similar papers found.