SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of foundation models in sketch understanding caused by the inherent abstraction and sparsity of sketches. We propose a dynamic feature fusion framework that synergistically integrates Stable Diffusion (SD) and CLIP. For the first time, we identify and characterize SD’s frequency-domain bias when processing sketches, and innovatively inject CLIP’s semantic features dynamically into SD’s denoising process. Our framework incorporates cross-layer feature injection, semantic-level adaptive aggregation, and frequency-domain-driven calibration, yielding the first general-purpose sketch foundation representation tailored for multiple tasks. Extensive experiments demonstrate consistent improvements: +3.35% in sketch retrieval, +1.06% in sketch recognition, +29.42% in sketch segmentation, and +21.22% in sketch correspondence learning—establishing new state-of-the-art results across all four benchmarks.

Technology Category

Application Category

📝 Abstract
While foundation models have revolutionised computer vision, their effectiveness for sketch understanding remains limited by the unique challenges of abstract, sparse visual inputs. Through systematic analysis, we uncover two fundamental limitations: Stable Diffusion (SD) struggles to extract meaningful features from abstract sketches (unlike its success with photos), and exhibits a pronounced frequency-domain bias that suppresses essential low-frequency components needed for sketch understanding. Rather than costly retraining, we address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval (+3.35%), recognition (+1.06%), segmentation (+29.42%), and correspondence learning (+21.22%), demonstrating the first truly universal sketch feature representation in the era of foundation models.
Problem

Research questions and friction points this paper is trying to address.

Overcome limitations of foundation models in sketch understanding.
Address Stable Diffusion's difficulty with abstract sketch features.
Enhance sketch retrieval, recognition, segmentation, and correspondence learning.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Stable Diffusion with CLIP for sketch understanding
Dynamically injects CLIP features into denoising process
Adaptively aggregates features across semantic levels
🔎 Similar Papers
No similar papers found.