Improving Physical Object State Representation in Text-to-Image Generative Systems

📅 2025-05-04

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Current text-to-image models struggle to accurately model physical object states—such as “a table without a bottle” or “an empty cup”—particularly failing to comprehend abstract states involving negation or emptiness. To address this, we propose the first fully automated synthetic data generation method specifically designed for object state modeling, producing high-quality, fine-grained state annotations. We introduce the first dedicated evaluation benchmark: a 200-prompt, fine-grained prompt set covering diverse everyday object states. Leveraging LoRA-based fine-tuning on open-source diffusion models and employing GPT-4o-mini as a vision-language alignment evaluator, we establish a dual-track evaluation framework integrating GenAI-Bench and our custom state-centric benchmark. Experiments demonstrate an average absolute improvement of over 8% on GenAI-Bench and over 24% on our custom prompt set. All generated data, source code, and evaluation protocols are publicly released.

Technology Category

Application Category

📝 Abstract

Current text-to-image generative models struggle to accurately represent object states (e.g.,"a table without a bottle,""an empty tumbler"). In this work, we first design a fully-automatic pipeline to generate high-quality synthetic data that accurately captures objects in varied states. Next, we fine-tune several open-source text-to-image models on this synthetic data. We evaluate the performance of the fine-tuned models by quantifying the alignment of the generated images to their prompts using GPT4o-mini, and achieve an average absolute improvement of 8+% across four models on the public GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific focus on common objects in various physical states. We demonstrate a significant improvement of an average of 24+% over the baseline on this dataset. We release all evaluation prompts and code.

Problem

Research questions and friction points this paper is trying to address.

Enhancing object state accuracy in text-to-image models

Generating synthetic data for varied object states

Improving image-prompt alignment for physical states

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for synthetic data generation

Fine-tuning models with synthetic data

Improved alignment using GPT4o-mini evaluation

🔎 Similar Papers

DivCon: Divide and Conquer for Progressive Text-to-Image Generation