From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

📅 2024-11-27

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 1

career value

213K/year

🤖 AI Summary

Traditional object detection operates under a closed-set assumption, rendering it incapable of recognizing novel categories unseen during training. While open-vocabulary detection (OVD) supports arbitrary vocabulary, it suffers from prompt sensitivity and poor generalization to out-of-distribution (OOD) objects and continual learning scenarios. To address these limitations, we propose a new open-world object detection paradigm. Our approach introduces the Open-World Embedding Learning (OWEL) framework, incorporating a pseudo-unknown embedding mechanism and Multi-Scale Contrastive Anchor Learning (MSCAL) to enable robust discrimination of both near- and far-OOD objects and facilitate incremental embedding optimization. By synergistically integrating vision-language models, semantic space modeling, and multi-scale contrastive learning, our method achieves state-of-the-art performance on standard open-world benchmarks (e.g., OWOD) and autonomous driving datasets (e.g., BDD100K-OW), significantly improving novel-category recognition accuracy and continual learning capability.

Technology Category

Application Category

📝 Abstract

Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ``oracle'', which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar features to known classes, and ignore far-out-of-distribution (FOOD) objects. To address these limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance on standard open world object detection and autonomous driving benchmarks while maintaining its open vocabulary object detection capability.

Problem

Research questions and friction points this paper is trying to address.

Enabling OVD models to detect novel objects in open world settings

Reducing misclassification of near-out-of-distribution (NOOD) objects

Improving detection of far-out-of-distribution (FOOD) objects

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open World Embedding Learning for unknown objects

Pseudo Unknown Embedding in semantic space

Multi-Scale Contrastive Anchor Learning consistency

🔎 Similar Papers

No similar papers found.