🤖 AI Summary
This work addresses a critical limitation of existing dual-encoder models like CLIP, which often assign high image-text similarity scores to “half-truth” captions—descriptions containing partially incorrect details—thereby violating commonsense reasoning. To mitigate this, the authors propose CS-CLIP, a novel approach that introduces component-level supervision for the first time. Specifically, textual descriptions are decomposed into fine-grained semantic units (entities and relations), and minimal-edit foils are constructed to serve as negative examples. The model is then fine-tuned via contrastive learning while preserving the standard dual-encoder architecture, explicitly enforcing grounded visual alignment for each semantic unit. Evaluated on COCO, CS-CLIP improves half-truth detection accuracy from 40.6% to 69.3% and achieves an average gain of 5.7 points across multiple compositional understanding benchmarks, substantially enhancing the model’s ability to discern local semantic fidelity and support compositional generalization.
📝 Abstract
When a text description is extended with an additional detail, image-text similarity should drop if that detail is wrong. We show that CLIP-style dual encoders often violate this intuition: appending a plausible but incorrect object or relation to an otherwise correct description can increase the similarity score. We call such cases half-truths. On COCO, CLIP prefers the correct shorter description only 40.6% of the time, and performance drops to 32.9% when the added detail is a relation. We trace this vulnerability to weak supervision on caption parts: contrastive training aligns full sentences but does not explicitly enforce that individual entities and relations are grounded. We propose CS-CLIP (Component-Supervised CLIP), which decomposes captions into entity and relation units, constructs a minimally edited foil for each unit, and fine-tunes the model to score the correct unit above its foil while preserving standard dual-encoder inference. CS-CLIP raises half-truth accuracy to 69.3% and improves average performance on established compositional benchmarks by 5.7 points, suggesting that reducing half-truth errors aligns with broader gains in compositional understanding. Code is publicly available at: https://github.com/kargibora/CS-CLIP