🤖 AI Summary
This work addresses key challenges in open-world object detection—such as difficulties in capturing fine-grained variations, recognizing rare categories, and handling insufficient visual evidence in complex scenes—stemming from reliance on coarse-grained textual semantics and parametric knowledge. To overcome these limitations, the authors propose VL-SAM-v3, a unified framework that introduces, for the first time, a non-parametric visual memory-guided mechanism. This mechanism retrieves an external visual memory bank to generate dual visual priors: sparse instance-level spatial anchors and dense category-aware local context. Integrated with a memory-guided prompt refinement strategy, the framework supports both open-vocabulary recognition and open-ended reasoning. Experiments demonstrate significant performance gains under zero-shot LVIS settings, particularly for rare categories, and show effective adaptability to stronger detectors such as SAM3, confirming the approach’s generality and efficacy.
📝 Abstract
Open-world object detection aims to localize and recognize objects beyond a fixed closed-set label space. It is commonly divided into two categories, i.e., open-vocabulary detection, which assumes a predefined category list at test time, and open-ended detection, which requires generating candidate categories during the inference. Existing methods rely primarily on coarse textual semantics and parametric knowledge, which often provide insufficient visual evidence for fine-grained appearance variation, rare categories, and cluttered scenes. In this paper, we propose VL-SAM-v3, a unified framework that augments open-world detection with retrieval-grounded external visual memory. Specifically, once candidate categories are available, VL-SAM-v3 retrieves relevant visual prototypes from a non-parametric memory bank and transforms them into two complementary visual priors, i.e., sparse priors for instance-level spatial anchoring and dense priors for class-aware local context. These priors are integrated with the original detection prompts via Memory-Guided Prompt Refinement, enabling a shared retrieval-and-refinement mechanism that supports open-vocabulary and open-ended inference.Extensive zero-shot experiments on LVIS show that VL-SAM-v3 consistently improves detection performance under both open-vocabulary and open-ended inference, with particularly strong gains on rare categories.Moreover, experiments with a stronger open-vocabulary detector (i.e., SAM3) validate the generality of the proposed retrieval-and-refinement mechanism.