Chimera: Compositional Image Generation using Part-based Concepting

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing personalized image generation methods struggle to accurately compose specified components from multiple source images without user-provided masks or annotations. To address this, we propose a text-instructed, part-level controllable synthesis framework. Our approach introduces “semantic atoms”—structured part-subject pairs—to model local semantics, and constructs a large-scale training dataset comprising 464 semantic atoms and 37,000 image-text pairs. We design a part-conditioned custom diffusion prior model and introduce PartEval, a novel metric quantifying part alignment and compositional accuracy. Experiments demonstrate that our method achieves a 14% improvement in part composition accuracy and a 21% gain in visual quality over state-of-the-art baselines, as measured by both human evaluation and PartEval. To the best of our knowledge, this is the first work to enable fine-grained, cross-image part-level controllable synthesis without requiring manual annotations.

Technology Category

Application Category

📝 Abstract
Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.
Problem

Research questions and friction points this paper is trying to address.

Generating images by combining parts from multiple sources without masks
Enforcing semantic identity and spatial layout through conditional guidance
Assessing compositional accuracy in part-based image generation pipelines
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines object parts from multiple source images
Trains diffusion prior with part-conditional guidance
Introduces PartEval metric for compositional accuracy
🔎 Similar Papers
No similar papers found.
S
Shivam Singh
Arizona State University
Y
Yiming Chen
Georgia Institute of Technology
Agneet Chatterjee
Agneet Chatterjee
Arizona State University
Computer VisionMachine Learning
A
Amit Raj
Google Deepmind
James Hays
James Hays
Georgia Tech
Computer VisionRoboticsMachine LearningAI
Y
Yezhou Yang
Arizona State University
C
Chitra Baral
Arizona State University