π€ AI Summary
Small-object image retrieval (SoIR) in cluttered scenes requires a single, compact image embedding that efficiently represents multiple objects while enabling scalable searchβa longstanding challenge. Method: We propose the Multi-object Attention Optimization (MaO) framework, featuring: (1) a novel multi-object collaborative pretraining paradigm that explicitly models multiple objects within an image; (2) a mask-guided attention-based feature fusion mechanism for fine-grained, object-level feature extraction and aggregation; and (3) generation of a unified image embedding with strong discriminability and generalizability. MaO supports zero-shot transfer and lightweight fine-tuning. Results: Evaluated on a newly constructed SoIR benchmark, MaO significantly outperforms existing methods, achieving absolute mAP improvements of 12.7% (zero-shot) and 9.3% (fine-tuned), demonstrating its effectiveness and practicality for real-world SoIR tasks.
π Abstract
We address the challenge of Small Object Image Retrieval (SoIR), where the goal is to retrieve images containing a specific small object, in a cluttered scene. The key challenge in this setting is constructing a single image descriptor, for scalable and efficient search, that effectively represents all objects in the image. In this paper, we first analyze the limitations of existing methods on this challenging task and then introduce new benchmarks to support SoIR evaluation. Next, we introduce Multi-object Attention Optimization (MaO), a novel retrieval framework which incorporates a dedicated multi-object pre-training phase. This is followed by a refinement process that leverages attention-based feature extraction with object masks, integrating them into a single unified image descriptor. Our MaO approach significantly outperforms existing retrieval methods and strong baselines, achieving notable improvements in both zero-shot and lightweight multi-object fine-tuning. We hope this work will lay the groundwork and inspire further research to enhance retrieval performance for this highly practical task.