Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions

📅 2024-03-24
🏛️ AAAI Conference on Artificial Intelligence
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of retrieving complex interactive objects (e.g., “a numbat digging on the ground”) that are difficult for non-native speakers or vocabulary-limited users to name yet sketchable—or describable yet hard to sketch, this paper introduces Composite Sketch-Text-Based Image Retrieval (CSTBIR), a novel cross-modal retrieval task. We present the first large-scale CSTBIR benchmark comprising 2 million composite queries and 108K images. To tackle this task, we propose STNet, a multi-modal Transformer architecture optimized via multi-objective learning: it jointly models sketch–image local alignment, text–image joint embedding, and hierarchical losses for object localization, cross-modal matching, and interaction modeling. Extensive experiments demonstrate that STNet significantly outperforms state-of-the-art text-based (TBIR), sketch-based (SBIR), and hybrid retrieval methods on composite queries. Both code and dataset are publicly released to advance fine-grained cross-modal understanding research.

Technology Category

Application Category

📝 Abstract
Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for ‘numbats.’ Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., “numbat digging in the ground.” In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of “difficult-to-name but easy-to-draw” objects and text describing “difficult-to-sketch but easy-to-verbalize” object's attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of ~2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNet (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at: https://vl2g.github.io/projects/cstbir.
Problem

Research questions and friction points this paper is trying to address.

Develops a multimodal search interface for elusive objects.
Combines sketches and text for complex object retrieval.
Introduces CSTBIR dataset and STNET model for retrieval.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal composite query interface
Pretrained transformer for localization
Contrastive learning with multiple objectives
🔎 Similar Papers
No similar papers found.
Prajwal Gatti
Prajwal Gatti
PhD student, University of Bristol
Computer VisionDeep Learning
K
Kshitij Parikh
Indian Institute of Technology Jodhpur
D
Dhriti Prasanna Paul
Indian Institute of Technology Jodhpur
M
Manish Gupta
Microsoft
Anand Mishra
Anand Mishra
IIT Jodhpur
Computer VisionMachine Learning