DETR-ViP: Detection Transformer with Robust Discriminative Visual Prompts

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

145K/year

🤖 AI Summary

This work addresses the limited performance of existing vision prompt-based object detection methods, which stems from the lack of global discriminability in visual prompts—a critical issue long overlooked. To tackle this, we propose DETR-ViP, a novel framework that systematically analyzes the root cause and introduces a dedicated detection architecture. Our approach enhances visual prompt representations through global visual prompt integration, distillation of vision–text prompt relationships, and selective feature fusion. Built upon the Detection Transformer and augmented with image–text contrastive learning, DETR-ViP achieves state-of-the-art performance across multiple benchmarks—including COCO, LVIS, ODinW, and Roboflow100—significantly outperforming current methods in open-vocabulary vision prompt detection tasks.

Technology Category

Application Category

📝 Abstract

Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP incorporates global prompt integration and visual-textual prompt relation distillation to learn more discriminative prompt representations. In addition, DETR-ViP employs a selective fusion strategy that ensures stable and robust detection. Extensive experiments on COCO, LVIS, ODinW, and Roboflow100 demonstrate that DETR-ViP achieves substantially higher performance in visual prompt detection compared to other state-of-the-art counterparts. A series of ablation studies and analyses further validate the effectiveness of the proposed improvements and shed light on the underlying reasons for the enhanced detection capability of visual prompts.

Problem

Research questions and friction points this paper is trying to address.

visual prompt

object detection

discriminability

open-vocabulary detection

prompt representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

visual prompts

discriminative representation

prompt distillation