ODOV: Towards Open-Domain Open-Vocabulary Object Detection

📅 2025-08-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses open-domain open-vocabulary (ODOV) object detection—a challenging setting where models must generalize to previously unseen domain-category combinations. To tackle this, we propose the first end-to-end ODOV detection framework. Our method leverages large language models to generate domain-agnostic text prompts and jointly optimizes vision-language alignment and domain adaptation via image-driven domain embedding learning, dynamically generating test-image-specific category embeddings. To enable systematic evaluation, we introduce OD-LVIS, the first large-scale benchmark for ODOV detection, encompassing 18 real-world domains and 1,203 categories. Extensive experiments demonstrate that our approach significantly outperforms existing methods on OD-LVIS, validating both the task formulation and the benchmark’s utility, while establishing a novel paradigm for cross-domain, cross-category detection in realistic scenarios.

Technology Category

Application Category

📝 Abstract
In this work, we handle a new problem of Open-Domain Open-Vocabulary (ODOV) object detection, which considers the detection model's adaptability to the real world including both domain and category shifts. For this problem, we first construct a new benchmark OD-LVIS, which includes 46,949 images, covers 18 complex real-world domains and 1,203 categories, and provides a comprehensive dataset for evaluating real-world object detection. Besides, we develop a novel baseline method for ODOV detection.The proposed method first leverages large language models to generate the domain-agnostic text prompts for category embedding. It further learns the domain embedding from the given image, which, during testing, can be integrated into the category embedding to form the customized domain-specific category embedding for each test image. We provide sufficient benchmark evaluations for the proposed ODOV detection task and report the results, which verify the rationale of ODOV detection, the usefulness of our benchmark, and the superiority of the proposed method.
Problem

Research questions and friction points this paper is trying to address.

Addressing Open-Domain Open-Vocabulary object detection challenges
Creating a benchmark for real-world domain and category shifts
Developing domain-specific embeddings using large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Constructs OD-LVIS benchmark with 46,949 images
Uses large language models for domain-agnostic prompts
Learns domain embedding from images for customization
🔎 Similar Papers
No similar papers found.
Y
Yupeng Zhang
College of Intelligence and Computing, Tianjin University, Tianjin, China 300350
Ruize Han
Ruize Han
SUAT
Computer VisionMultimedia AnalysisVideo UnderstandingActive Vision
F
Fangnan Zhou
College of Intelligence and Computing, Tianjin University, Tianjin, China 300350
S
Song Wang
Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, China 518107
W
Wei Feng
College of Intelligence and Computing, Tianjin University, Tianjin, China 300350
L
Liang Wan
College of Intelligence and Computing, Tianjin University, Tianjin, China 300350