InstanceGen: Image Generation with Instance-level Instructions

📅 2025-05-08

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Current text-to-image models struggle to accurately model object cardinality, instance-level attributes, and spatial relationships specified in complex prompts, resulting in low semantic alignment between generated images and textual descriptions. To address this, we propose a structure-aware end-to-end alignment framework that uniquely leverages intrinsic image structures—such as segmentation masks and layout heatmaps—extracted from the generated image itself as fine-grained initialization signals. These structural cues are tightly coupled with instance-level instructions parsed by a large language model (LLM) to guide the diffusion process. Our method integrates vision-language joint reasoning, LLM-driven instruction understanding, and image-structure inversion. Evaluated on multi-instance composition benchmarks, our approach achieves significant gains in prompt fidelity: object counting accuracy improves by 32%, while attribute and spatial layout consistency reach state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, %leveraging additional structural inputs typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible emph{fine-grained} structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances.

Problem

Research questions and friction points this paper is trying to address.

Text-to-image models fail with complex multi-object prompts

Existing methods lack fine-grained structural guidance

Need better adherence to instance-level attributes and relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses image-based structural guidance

Integrates LLM-based instance-level instructions

Enhances adherence to complex text prompts

🔎 Similar Papers

No similar papers found.