🤖 AI Summary
This work addresses the practical limitations of text-to-image diffusion models—such as prompt sensitivity, semantic ambiguity, and generation artifacts—by introducing the first training-free multimodal agent framework. Built upon Qwen-VL, Qwen-Image, Qwen-Edit, and Qwen-Embedding, the framework establishes a multi-stage reasoning pipeline that leverages a vector database to store and retrieve historical experiences. This enables autonomous prompt refinement, defect detection, and fine-grained artifact correction. The approach supports end-to-end prompt-guided generation with iterative self-improvement, achieving an average VQA score of 0.884 on GenAIBench, substantially outperforming existing open-source and closed-source models as well as current agent-based solutions.
📝 Abstract
Text-to-image diffusion models have revolutionized generative AI, enabling high-quality and photorealistic image synthesis. However, their practical deployment remains hindered by several limitations: sensitivity to prompt phrasing, ambiguity in semantic interpretation (e.g., ``mouse"as animal vs. a computer peripheral), artifacts such as distorted anatomy, and the need for carefully engineered input prompts. Existing methods often require additional training and offer limited controllability, restricting their adaptability in real-world applications. We introduce Self-Improving Diffusion Agent (SIDiffAgent), a training-free agentic framework that leverages the Qwen family of models (Qwen-VL, Qwen-Image, Qwen-Edit, Qwen-Embedding) to address these challenges. SIDiffAgent autonomously manages prompt engineering, detects and corrects poor generations, and performs fine-grained artifact removal, yielding more reliable and consistent outputs. It further incorporates iterative self-improvement by storing a memory of previous experiences in a database. This database of past experiences is then used to inject prompt-based guidance at each stage of the agentic pipeline. \modelour achieved an average VQA score of 0.884 on GenAIBench, significantly outperforming open-source, proprietary models and agentic methods. We will publicly release our code upon acceptance.