Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

208K/year
🤖 AI Summary
This work addresses the poor performance of vision-language models in compositional reasoning, which stems from their sensitivity to word order and frequent attribute-binding errors—issues rooted in the lack of fine-grained negative samples during contrastive pretraining. To tackle this, the authors propose Slipform, a novel framework that introduces lexical concreteness scores from psycholinguistics into the negative sampling strategy. Central to Slipform is the ConcretePlant module, designed to identify and manipulate perceptually groundable concepts, coupled with a margin-based Cement loss that dynamically adjusts penalty strength to mitigate gradient imbalance in InfoNCE. This approach substantially enhances the model’s capacity to discern subtle semantic distinctions, achieving state-of-the-art performance across multiple benchmarks in compositional understanding, cross-modal retrieval, and linear probing tasks.

Technology Category

Application Category

📝 Abstract
Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.
Problem

Research questions and friction points this paper is trying to address.

compositional reasoning
contrastive learning
negative mining
vision-language models
attribute binding
Innovation

Methods, ideas, or system contributions that make the work stand out.

concreteness-aware negative mining
compositional reasoning
gradient balancing
margin-based loss
vision-language models