Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in image generation from hand-drawn sketches—such as detail loss, spatial misalignment, and poor cross-domain adaptability—stemming from their abstractness, sparsity, and stylistic diversity. To overcome these issues, the authors propose a two-stage generative framework: first, a self-attention autoencoder (SA2N) extracts local semantic and structural features from the sketch; second, a coordinate-preserving gated fusion (CGF) module maintains spatial layout integrity, followed by an iterative refinement stage using a StyleGAN2-based spatially adaptive refinement rectifier (SARR). Through a component-aware mechanism, coordinate-preserving fusion strategy, and spatially guided refinement pipeline, the method significantly enhances both fidelity and semantic consistency of generated images. Extensive experiments demonstrate superior performance over state-of-the-art GANs and diffusion models on both facial and non-facial datasets, achieving improvements of 21% in FID, 58% in IS, 41% in KID, and 20% in SSIM on CelebAMask-HQ.

Technology Category

Application Category

📝 Abstract
Translating freehand sketches into photorealistic images remains a fundamental challenge in image synthesis, particularly due to the abstract, sparse, and stylistically diverse nature of sketches. Existing approaches, including GAN-based and diffusion-based models, often struggle to reconstruct fine-grained details, maintain spatial alignment, or adapt across different sketch domains. In this paper, we propose a component-aware, self-refining framework for sketch-to-image generation that addresses these challenges through a novel two-stage architecture. A Self-Attention-based Autoencoder Network (SA2N) first captures localised semantic and structural features from component-wise sketch regions, while a Coordinate-Preserving Gated Fusion (CGF) module integrates these into a coherent spatial layout. Finally, a Spatially Adaptive Refinement Revisor (SARR), built on a modified StyleGAN2 backbone, enhances realism and consistency through iterative refinement guided by spatial context. Extensive experiments across both facial (CelebAMask-HQ, CUFSF) and non-facial (Sketchy, ChairsV2, ShoesV2) datasets demonstrate the robustness and generalizability of our method. The proposed framework consistently outperforms state-of-the-art GAN and diffusion models, achieving significant gains in image fidelity, semantic accuracy, and perceptual quality. On CelebAMask-HQ, our model improves over prior methods by 21% (FID), 58% (IS), 41% (KID), and 20% (SSIM). These results, along with higher efficiency and visual coherence across diverse domains, position our approach as a strong candidate for applications in forensics, digital art restoration, and general sketch-based image synthesis.
Problem

Research questions and friction points this paper is trying to address.

sketch-to-image generation
photorealistic image synthesis
spatial alignment
fine-grained detail reconstruction
cross-domain sketch adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Attention Encoding
Coordinate-Preserving Fusion
Component-Aware Generation
Sketch-to-Image Synthesis
Spatially Adaptive Refinement
🔎 Similar Papers
No similar papers found.
A
Ali Zia
School of Computing, Engineering and Mathematical Sciences, La Trobe University
M
Muhammad Umer Ramzan
Gujranwala Institute of Future Technologies (GIFT) University
U
Usman Ali
Gujranwala Institute of Future Technologies (GIFT) University
M
Muhammad Faheem
Gujranwala Institute of Future Technologies (GIFT) University
Abdelwahed Khamis
Abdelwahed Khamis
Research Scientist, CSIRO
Multimodal SensingDevice Free SensingMobile ComputingMachine Learning
S
Shahnawaz Qureshi
Sino-Pak Centre for Artificial Intelligence, Pak-Austria Fachhochschule Institute of Applied Sciences and Technology