StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Diffusion models suffer from low prompt fidelity and a scarcity of fine-grained sketch–text paired data for pixel-level hand-drawn sketch generation. To address these challenges, this paper proposes StableSketcher: (1) the first large-scale SketchDUO dataset containing instance-level sketch–caption–visual question-answering (VQA) triplets; (2) a VQA-guided reward function integrated into a reinforcement learning framework to optimize sketch generation toward prompt alignment; and (3) a fine-tuned variational autoencoder to enhance latent-space decoding quality and sketch fidelity. Extensive experiments demonstrate that StableSketcher significantly outperforms the Stable Diffusion baseline in both stylistic fidelity and prompt consistency, enabling high semantic alignment between input text prompts and generated hand-drawn sketches. The framework bridges the gap between textual intent and expressive, controllable sketch synthesis.

Technology Category

Application Category

📝 Abstract

Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Improving pixel-based sketch generation using diffusion models

Enhancing text-image alignment through visual question answering

Addressing dataset limitations with new sketch-caption-QA pairs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tuned variational autoencoder for sketch latent decoding

Visual question answering feedback for reinforcement learning

Introduced SketchDUO dataset with captions and QA pairs

🔎 Similar Papers

Training-Free Sketch-Guided Diffusion with Latent Optimization