Boosting Segment Anything Model Towards Open-Vocabulary Learning

📅 2023-12-06
🏛️ arXiv.org
📈 Citations: 6
Influential: 1
📄 PDF
🤖 AI Summary
SAM lacks semantic understanding and struggles with open-vocabulary object detection. Method: We propose OS-SAM—the first end-to-end SAM extension enabling open-vocabulary recognition. It introduces a semantic-aware SideFormer module to fuse multimodal features, an open-set RPN that leverages SAM proposals as priors to generate open-vocabulary candidate boxes, and joint localization-classification training via cross-modal alignment and optimization. Contribution/Results: OS-SAM is the first framework to natively adapt SAM for open-vocabulary detection without fine-tuning the image encoder, supporting zero-shot localization and recognition from category names or natural language descriptions. On COCO and LVIS zero-shot detection benchmarks, it significantly outperforms prior state-of-the-art methods, achieving simultaneous improvements in localization accuracy and classification precision—demonstrating a viable pathway for vision foundation models to drive open-vocabulary learning.
📝 Abstract
The recent Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model, showcasing potent zero-shot generalization and flexible prompting. Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics. In this paper, we present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework. While retaining all the remarkable capabilities inherent to SAM, we boost it to detect arbitrary objects from human inputs like category names or reference expressions. Building upon the SAM image encoder, we introduce a novel SideFormer module designed to acquire SAM features adept at perceiving objects and inject comprehensive semantic information for recognition. In addition, we devise an Open-set RPN that leverages SAM proposals to assist in finding potential objects. Consequently, Sambor enables the open-vocabulary detector to equally focus on generalizing both localization and classification sub-tasks. Our approach demonstrates superior zero-shot performance across benchmarks, including COCO and LVIS, proving highly competitive against previous state-of-the-art methods. We aspire for this work to serve as a meaningful endeavor in endowing SAM to recognize diverse object categories and advancing open-vocabulary learning with the support of vision foundation models.
Problem

Research questions and friction points this paper is trying to address.

Enhance SAM for open-vocabulary object detection
Integrate semantic information into SAM features
Improve localization and classification in zero-shot scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates SAM with open-vocabulary detector
Introduces SideFormer for semantic enhancement
Utilizes Open-set RPN for object detection
🔎 Similar Papers
No similar papers found.
Xumeng Han
Xumeng Han
University of Chinese Academy of Sciences
Computer Vision
Longhui Wei
Longhui Wei
Senior Researcher, Huawei
multimodal&Visual pre-trainingVLMMultimodal Generation
X
Xuehui Yu
University of Chinese Academy of Sciences
Z
Zhiyang Dou
University of Chinese Academy of Sciences
X
Xin He
Huawei Cloud
Kuiran Wang
Kuiran Wang
University of Chinese Academy of Sciences
Object tracking Computer vision
Z
Zhenjun Han
University of Chinese Academy of Sciences
Q
Qi Tian
Huawei Cloud