🤖 AI Summary
Current text-to-image models struggle to accurately model physical object states—such as “a table without a bottle” or “an empty cup”—particularly failing to comprehend abstract states involving negation or emptiness. To address this, we propose the first fully automated synthetic data generation method specifically designed for object state modeling, producing high-quality, fine-grained state annotations. We introduce the first dedicated evaluation benchmark: a 200-prompt, fine-grained prompt set covering diverse everyday object states. Leveraging LoRA-based fine-tuning on open-source diffusion models and employing GPT-4o-mini as a vision-language alignment evaluator, we establish a dual-track evaluation framework integrating GenAI-Bench and our custom state-centric benchmark. Experiments demonstrate an average absolute improvement of over 8% on GenAI-Bench and over 24% on our custom prompt set. All generated data, source code, and evaluation protocols are publicly released.
📝 Abstract
Current text-to-image generative models struggle to accurately represent object states (e.g.,"a table without a bottle,""an empty tumbler"). In this work, we first design a fully-automatic pipeline to generate high-quality synthetic data that accurately captures objects in varied states. Next, we fine-tune several open-source text-to-image models on this synthetic data. We evaluate the performance of the fine-tuned models by quantifying the alignment of the generated images to their prompts using GPT4o-mini, and achieve an average absolute improvement of 8+% across four models on the public GenAI-Bench dataset. We also curate a collection of 200 prompts with a specific focus on common objects in various physical states. We demonstrate a significant improvement of an average of 24+% over the baseline on this dataset. We release all evaluation prompts and code.