🤖 AI Summary
Diffusion models suffer from low prompt fidelity and a scarcity of fine-grained sketch–text paired data for pixel-level hand-drawn sketch generation. To address these challenges, this paper proposes StableSketcher: (1) the first large-scale SketchDUO dataset containing instance-level sketch–caption–visual question-answering (VQA) triplets; (2) a VQA-guided reward function integrated into a reinforcement learning framework to optimize sketch generation toward prompt alignment; and (3) a fine-tuned variational autoencoder to enhance latent-space decoding quality and sketch fidelity. Extensive experiments demonstrate that StableSketcher significantly outperforms the Stable Diffusion baseline in both stylistic fidelity and prompt consistency, enabling high semantic alignment between input text prompts and generated hand-drawn sketches. The framework bridges the gap between textual intent and expressive, controllable sketch synthesis.
📝 Abstract
Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.