What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

122K/year

🤖 AI Summary

Existing sample-level filtering methods in vision-language pretraining struggle to model fine-grained alignment between images and text at the level of objects, attributes, and relations, thereby limiting compositional generalization. This work proposes a Counterfactual Phrase Intervention (CPI) framework, which introduces counterfactual reasoning into data curation for the first time. By generating counterfactual captions through controlled nonce-token substitutions and computing phrase sensitivity scores conditioned on the image, CPI identifies semantic constituents critical to image-text matching. The approach is compatible with contrastive learning architectures such as CLIP and NegCLIP. Evaluated on CC3M, CPI achieves a 1.91-point improvement over full-data baselines and conventional filtering methods on the VL-CheckList-VG Relation task using only 50% of the data, while maintaining strong performance on SugarCrepe and general transfer benchmarks.

📝 Abstract

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

Problem

Research questions and friction points this paper is trying to address.

compositional generalization

vision-language pretraining

image-text alignment

data curation

phrase-level supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Counterfactual Phrase Intervention

compositional generalization

vision-language pretraining