🤖 AI Summary
Text-to-image generation faces significant bottlenecks in fine-grained controllability—particularly counterfactual controllability, such as synthesizing images violating commonsense size relations. This work focuses on size-counterfactual scenarios (e.g., “a tiny walrus beside a gigantic button”) and introduces the first text–image paired dataset explicitly designed for counterfactual size reasoning. We propose a three-module automatic prompt engineering framework: (1) an image evaluator leveraging extended Grounded SAM for precise spatial assessment; (2) a supervised prompt rewriter to enhance instruction fidelity; and (3) a DPO-based prompt ranker to balance plausibility and creativity. The framework jointly optimizes prompt quality for counterfactual generation, outperforming state-of-the-art methods and ChatGPT-4o. Our image evaluator achieves a 114% relative accuracy gain over baselines. This work establishes a new benchmark and scalable technical pathway for counterfactual controllability in generative vision.
📝 Abstract
Text-to-image generation has advanced rapidly with large-scale multimodal training, yet fine-grained controllability remains a critical challenge. Counterfactual controllability, defined as the capacity to deliberately generate images that contradict common-sense patterns, remains a major challenge but plays a crucial role in enabling creativity and exploratory applications. In this work, we address this gap with a focus on counterfactual size (e.g., generating a tiny walrus beside a giant button) and propose an automatic prompt engineering framework that adapts base prompts into revised prompts for counterfactual images. The framework comprises three components: an image evaluator that guides dataset construction by identifying successful image generations, a supervised prompt rewriter that produces revised prompts, and a DPO-trained ranker that selects the optimal revised prompt. We construct the first counterfactual size text-image dataset and enhance the image evaluator by extending Grounded SAM with refinements, achieving a 114 percent improvement over its backbone. Experiments demonstrate that our method outperforms state-of-the-art baselines and ChatGPT-4o, establishing a foundation for future research on counterfactual controllability.