🤖 AI Summary
In scene-level sketch-to-image retrieval, hand-drawn sketches suffer from inherent ambiguity, noise, and difficulty in semantic-layout alignment. To address these challenges, this paper proposes a robust cross-modal alignment framework. Methodologically, it introduces a novel training objective explicitly designed to accommodate sketch diversity, uncovering the critical impact of training strategies on retrieval performance; rather than increasing model complexity, it achieves efficient alignment via synergistic optimization—leveraging pretrained initialization, lightweight encoder refinement, and a customized contrastive loss. The proposed approach achieves state-of-the-art performance on FS-COCO and SketchyCOCO, significantly improving retrieval robustness and generalization. Moreover, it advances evaluation paradigms toward realistic sketch scenarios, bridging the gap between controlled benchmarks and practical deployment.
📝 Abstract
The goal of Scene-level Sketch-Based Image Retrieval is to retrieve natural images matching the overall semantics and spatial layout of a free-hand sketch. Unlike prior work focused on architectural augmentations of retrieval models, we emphasize the inherent ambiguity and noise present in real-world sketches. This insight motivates a training objective that is explicitly designed to be robust to sketch variability. We show that with an appropriate combination of pre-training, encoder architecture, and loss formulation, it is possible to achieve state-of-the-art performance without the introduction of additional complexity. Extensive experiments on a challenging FS-COCO and widely-used SketchyCOCO datasets confirm the effectiveness of our approach and underline the critical role of training design in cross-modal retrieval tasks, as well as the need to improve the evaluation scenarios of scene-level SBIR.