🤖 AI Summary
Existing open-set object detection methods rely solely on positive prompts (e.g., textual descriptions or visual exemplars), rendering them vulnerable to semantically distinct yet visually similar distractors. To address this limitation, we propose— for the first time—the integration of *negative visual prompts*, and introduce a unified framework jointly encoding both positive and negative prompts. We design a training-free *Negating Negative Computing* (NNC) module to suppress negative responses, and propose the *Negating Negative Hinge* (NNH) loss for fine-tuning. Our approach significantly enhances discriminative capability: on the LVIS-minival zero-shot detection benchmark, it achieves 51.2 AP<sub>r</sub>, substantially narrowing the performance gap between vision-based and text-based prompting methods—particularly excelling on long-tailed categories.
📝 Abstract
Object detection methods have evolved from closed-set to open-set paradigms over the years. Current open-set object detectors, however, remain constrained by their exclusive reliance on positive indicators based on given prompts like text descriptions or visual exemplars. This positive-only paradigm experiences consistent vulnerability to visually similar but semantically different distractors. We propose T-Rex-Omni, a novel framework that addresses this limitation by incorporating negative visual prompts to negate hard negative distractors. Specifically, we first introduce a unified visual prompt encoder that jointly processes positive and negative visual prompts. Next, a training-free Negating Negative Computing (NNC) module is proposed to dynamically suppress negative responses during the probability computing stage. To further boost performance through fine-tuning, our Negating Negative Hinge (NNH) loss enforces discriminative margins between positive and negative embeddings. T-Rex-Omni supports flexible deployment in both positive-only and joint positive-negative inference modes, accommodating either user-specified or automatically generated negative examples. Extensive experiments demonstrate remarkable zero-shot detection performance, significantly narrowing the performance gap between visual-prompted and text-prompted methods while showing particular strength in long-tailed scenarios (51.2 AP_r on LVIS-minival). This work establishes negative prompts as a crucial new dimension for advancing open-set visual recognition systems.