What "Not" to Detect: Negation-Aware VLMs via Structured Reasoning and Token Merging

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

Current vision-language models (VLMs) suffer from a pervasive “affirmative bias,” impairing their ability to correctly interpret negation, which leads to substantial false positives in descriptive object detection (DOD). To address this, we propose a systematic solution: (1) constructing CoVAND, a high-quality negation-aware dataset, via VQA-driven and chain-of-thought reasoning to generate diverse negative descriptions; (2) designing NegToMe, a text token merging module that explicitly preserves negation polarity; and (3) enabling parameter-efficient fine-tuning via LoRA. Our method achieves a +10.8-point gain in NMS-AP on OVDEval and significantly reduces false detection rates. Moreover, it generalizes effectively to mainstream VLMs. The core contribution lies in the first holistic mitigation of VLMs’ negation-understanding deficiency—integrating negation-aware data construction, polarity-explicit semantic modeling, and lightweight adaptation.

Technology Category

Application Category

📝 Abstract

State-of-the-art vision-language models (VLMs) suffer from a critical failure in understanding negation, often referred to as affirmative bias. This limitation is particularly severe in described object detection (DOD) tasks. To address this, we propose two primary contributions: (1) a new dataset pipeline and (2) a novel, lightweight adaptation recipe. First, we introduce CoVAND, a dataset constructed with a systematic chain-of-thought (CoT) and VQA-based pipeline to generate high-quality, instance-grounded negation data. Second, we propose NegToMe, a novel text token merging module that directly tackles the architectural cause of affirmative bias. NegToMe fundamentally addresses the structural loss of negation cues in tokenization, grouping them with attributes into coherent semantic phrases. It maintains correct polarity at the input level, enabling robust negation understanding even with limited data. For instance, to prevent a model from treating the fragmented tokens "not" and "girl" as simply "girl", NegToMe binds them into a single token whose meaning is correctly distinguished from that of "girl" alone. This module is integrated with a parameter-efficient and strategic LoRA fine-tuning approach. Our method significantly improves performance on challenging negation benchmarks with a lowered false positive rate, boosting NMS-AP by up to +10.8 points on OVDEval and demonstrating generalization to SoTA VLMs. This work marks a crucial step forward in addressing negation understanding for real-world detection applications.

Problem

Research questions and friction points this paper is trying to address.

Addressing affirmative bias in vision-language models' negation understanding

Improving described object detection performance with negation-aware processing

Resolving structural loss of negation cues during text tokenization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset pipeline generates negation data via chain-of-thought

Token merging module groups negation cues with attributes

Lightweight adaptation combines structured reasoning with LoRA fine-tuning

🔎 Similar Papers

No similar papers found.