Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis

📅 2025-09-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image generation faces significant bottlenecks in fine-grained controllability—particularly counterfactual controllability, such as synthesizing images violating commonsense size relations. This work focuses on size-counterfactual scenarios (e.g., “a tiny walrus beside a gigantic button”) and introduces the first text–image paired dataset explicitly designed for counterfactual size reasoning. We propose a three-module automatic prompt engineering framework: (1) an image evaluator leveraging extended Grounded SAM for precise spatial assessment; (2) a supervised prompt rewriter to enhance instruction fidelity; and (3) a DPO-based prompt ranker to balance plausibility and creativity. The framework jointly optimizes prompt quality for counterfactual generation, outperforming state-of-the-art methods and ChatGPT-4o. Our image evaluator achieves a 114% relative accuracy gain over baselines. This work establishes a new benchmark and scalable technical pathway for counterfactual controllability in generative vision.

Technology Category

Application Category

📝 Abstract
Text-to-image generation has advanced rapidly with large-scale multimodal training, yet fine-grained controllability remains a critical challenge. Counterfactual controllability, defined as the capacity to deliberately generate images that contradict common-sense patterns, remains a major challenge but plays a crucial role in enabling creativity and exploratory applications. In this work, we address this gap with a focus on counterfactual size (e.g., generating a tiny walrus beside a giant button) and propose an automatic prompt engineering framework that adapts base prompts into revised prompts for counterfactual images. The framework comprises three components: an image evaluator that guides dataset construction by identifying successful image generations, a supervised prompt rewriter that produces revised prompts, and a DPO-trained ranker that selects the optimal revised prompt. We construct the first counterfactual size text-image dataset and enhance the image evaluator by extending Grounded SAM with refinements, achieving a 114 percent improvement over its backbone. Experiments demonstrate that our method outperforms state-of-the-art baselines and ChatGPT-4o, establishing a foundation for future research on counterfactual controllability.
Problem

Research questions and friction points this paper is trying to address.

Automated prompt generation for counterfactual image synthesis
Enhancing fine-grained controllability in text-to-image generation
Addressing counterfactual size challenges through prompt engineering
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated prompt engineering framework for counterfactual image synthesis
Three-component system with evaluator, rewriter, and ranker modules
Enhanced Grounded SAM achieves 114% improvement over backbone
🔎 Similar Papers
No similar papers found.
A
Aleksa Jelaca
KU Leuven, Leuven, Belgium
Y
Ying Jiao
KU Leuven, Leuven, Belgium
C
Chang Tian
KU Leuven, Leuven, Belgium
Marie-Francine Moens
Marie-Francine Moens
Professor of Computer Science KU Leuven
Natural language processing and understandingmachine learninginformation retrievalmultimedia