Half-Truths Break Similarity-Based Retrieval

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation of existing dual-encoder models like CLIP, which often assign high image-text similarity scores to “half-truth” captions—descriptions containing partially incorrect details—thereby violating commonsense reasoning. To mitigate this, the authors propose CS-CLIP, a novel approach that introduces component-level supervision for the first time. Specifically, textual descriptions are decomposed into fine-grained semantic units (entities and relations), and minimal-edit foils are constructed to serve as negative examples. The model is then fine-tuned via contrastive learning while preserving the standard dual-encoder architecture, explicitly enforcing grounded visual alignment for each semantic unit. Evaluated on COCO, CS-CLIP improves half-truth detection accuracy from 40.6% to 69.3% and achieves an average gain of 5.7 points across multiple compositional understanding benchmarks, substantially enhancing the model’s ability to discern local semantic fidelity and support compositional generalization.

Technology Category

Application Category

📝 Abstract
When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP
Problem

Research questions and friction points this paper is trying to address.

half-truths
image-text similarity
compositional understanding
dual encoders
CLIP
Innovation

Methods, ideas, or system contributions that make the work stand out.

half-truths
component supervision
compositional understanding
CLIP
foil-based fine-tuning
🔎 Similar Papers
No similar papers found.
B
Bora Kargi
University of Tübingen, Tübingen AI Center; Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA)
Arnas Uselis
Arnas Uselis
University of Tübingen
Seong Joon Oh
Seong Joon Oh
University of Tübingen
Machine learningComputer vision