LV-OSD: Language-Vision-Complementary Open-Set Object Detection

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the limitation of conventional open-set object detection methods that rely on fixed category lists by introducing, for the first time, a language–vision complementary open-set detection task. This formulation enables dynamic specification of target categories through arbitrary textual and/or image prompts. To this end, the authors propose LVDor, a dual-branch detection framework equipped with a Target-guided Prompt Dynamic Weighting (TPDW) module and a Prompt Random Masking (PRM) training strategy, which jointly facilitate effective fusion of multimodal prompts and bridge the semantic gap between modalities. Extensive experiments demonstrate that the proposed approach achieves high detection accuracy across diverse prompt combinations, thereby validating both the feasibility of the new task formulation and the efficacy of the model design.

📝 Abstract

Object detection is an important task in computer vision, which aims to detect the objects of interest. through the given category list or query images. In this work, we propose a new problem of language-visual-complementary open-set object detection (LV-OSD), i.e., using the flexible text-based and/or image-based prompts to specify the desired object categories. This setting is more common and practical in real-world applications. For this purpose, we design a dual-branch detection framework, LVDor, which can simultaneously accept both text and image prompts. Specifically, we first build the Multi-modal Prompts (MPr) containing various text descriptions and image samples for each category. Subsequently, to bridge the semantic gap among the input image, text prompts, and image prompts, we design a Target-guided Prompt Dynamic Weighting (TPDW) module. Guided by the prior information of the target image, this module dynamically produces the text and image prompts that best align with the target semantics, achieving precise alignment and effectively reducing the discrepancy between the two modalities, thereby accommodating the LV-OSD setting. We also propose a simple Prompt Random Masking (PRM) mechanism during training to simulate the arbitrary combination of text and/or image prompts in testing. Extensive experimental results verify our problem formulation's reasonability and our method's effectiveness. Prompts and code will be released publicly.

Problem

Research questions and friction points this paper is trying to address.

open-set object detection

language-vision

prompt-based detection

multi-modal prompts

object detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

open-set object detection

multimodal prompting

language-vision alignment