DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding

📅 2024-11-21
🏛️ arXiv.org
📈 Citations: 18
Influential: 3
📄 PDF
🤖 AI Summary
This work addresses open-world visual understanding by proposing DINO-X Pro, a unified object-centric vision model that tackles long-tailed distribution, zero-shot recognition, and multi-task collaborative perception. Methodologically: (1) it introduces a novel universal object prompting mechanism enabling prompt-free open-vocabulary detection; (2) it constructs Grounding-100M, a high-quality, 100-million-sample grounding dataset; and (3) it designs a multi-task joint decoder supporting detection, segmentation, pose estimation, visual captioning, and visual question answering. Built upon a Transformer encoder-decoder architecture, the model fuses textual, visual, and customizable multimodal prompts and undergoes large-scale grounding pretraining. Experiments demonstrate state-of-the-art performance: 56.0/59.8/52.4 AP on COCO, LVIS-minival, and LVIS-val zero-shot detection benchmarks, respectively; and 63.3/56.5 AP on rare classes in LVIS, surpassing prior art by +5.8/+5.0 AP.

Technology Category

Application Category

📝 Abstract
In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 AP and 5.0 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.
Problem

Research questions and friction points this paper is trying to address.

Develops a unified vision model for open-world object detection and understanding
Enhances long-tailed object detection with flexible prompt options
Improves open-vocabulary detection via large-scale grounding dataset pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based encoder-decoder for object representation
Flexible prompt options for open-world detection
Large-scale dataset pre-training for grounding capability
🔎 Similar Papers
No similar papers found.
Tianhe Ren
Tianhe Ren
PhD student of Electrical and Electronic Engineering, The University of Hong Kong
Computer VisionMachine LearningMulti-Modality
Y
Yihao Chen
International Digital Economy Academy (IDEA), IDEA Research
Qing Jiang
Qing Jiang
PhD student, South China University of Technology
Computer VisionOpen-set Object Detection
Zhaoyang Zeng
Zhaoyang Zeng
International Digital Economy Academy
Computer VisionMultimedia Understanding
Y
Yuda Xiong
International Digital Economy Academy (IDEA), IDEA Research
W
Wenlong Liu
International Digital Economy Academy (IDEA), IDEA Research
Zhengyu Ma
Zhengyu Ma
Pengcheng Laboratory
NeuroscienceNeural Network DynamicsComputational Physics
J
Junyi Shen
International Digital Economy Academy (IDEA), IDEA Research
Y
Yuan Gao
International Digital Economy Academy (IDEA), IDEA Research
Xiaoke Jiang
Xiaoke Jiang
Reseach@IDEA
Computer VisionIndustrial VisionComputer Networking
X
Xingyu Chen
International Digital Economy Academy (IDEA), IDEA Research
Z
Zhuheng Song
International Digital Economy Academy (IDEA), IDEA Research
Y
Yuhong Zhang
International Digital Economy Academy (IDEA), IDEA Research
H
Hongjie Huang
International Digital Economy Academy (IDEA), IDEA Research
H
Han Gao
International Digital Economy Academy (IDEA), IDEA Research
Shilong Liu
Shilong Liu
RS@ByteDance, PhD@THU
Computer VisionObject DetectionVisual GroundingMulti-ModalityMultimodal Agent
H
Hao Zhang
International Digital Economy Academy (IDEA), IDEA Research
F
Feng Li
International Digital Economy Academy (IDEA), IDEA Research
K
Kent Yu
International Digital Economy Academy (IDEA), IDEA Research
L
Lei Zhang
International Digital Economy Academy (IDEA), IDEA Research