Automated Prompt Generation for Creative and Counterfactual Text-to-image Synthesis

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Text-to-image generation faces significant bottlenecks in fine-grained controllability—particularly counterfactual controllability, such as synthesizing images violating commonsense size relations. This work focuses on size-counterfactual scenarios (e.g., “a tiny walrus beside a gigantic button”) and introduces the first text–image paired dataset explicitly designed for counterfactual size reasoning. We propose a three-module automatic prompt engineering framework: (1) an image evaluator leveraging extended Grounded SAM for precise spatial assessment; (2) a supervised prompt rewriter to enhance instruction fidelity; and (3) a DPO-based prompt ranker to balance plausibility and creativity. The framework jointly optimizes prompt quality for counterfactual generation, outperforming state-of-the-art methods and ChatGPT-4o. Our image evaluator achieves a 114% relative accuracy gain over baselines. This work establishes a new benchmark and scalable technical pathway for counterfactual controllability in generative vision.

Technology Category

Application Category

📝 Abstract

Text-to-image generation has advanced rapidly with large-scale multimodal training, yet fine-grained controllability remains a critical challenge. Counterfactual controllability, defined as the capacity to deliberately generate images that contradict common-sense patterns, remains a major challenge but plays a crucial role in enabling creativity and exploratory applications. In this work, we address this gap with a focus on counterfactual size (e.g., generating a tiny walrus beside a giant button) and propose an automatic prompt engineering framework that adapts base prompts into revised prompts for counterfactual images. The framework comprises three components: an image evaluator that guides dataset construction by identifying successful image generations, a supervised prompt rewriter that produces revised prompts, and a DPO-trained ranker that selects the optimal revised prompt. We construct the first counterfactual size text-image dataset and enhance the image evaluator by extending Grounded SAM with refinements, achieving a 114 percent improvement over its backbone. Experiments demonstrate that our method outperforms state-of-the-art baselines and ChatGPT-4o, establishing a foundation for future research on counterfactual controllability.

Problem

Research questions and friction points this paper is trying to address.

Automated prompt generation for counterfactual image synthesis

Enhancing fine-grained controllability in text-to-image generation

Addressing counterfactual size challenges through prompt engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated prompt engineering framework for counterfactual image synthesis

Three-component system with evaluator, rewriter, and ranker modules

Enhanced Grounded SAM achieves 114% improvement over backbone

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

PhD – Generative Models for Closed-loop Synthesis

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)