🤖 AI Summary
Large multimodal models (LMMs) such as CLIP exhibit degraded zero-shot performance when composing common objects in atypical configurations, suggesting limitations in compositional generalization. Method: This study investigates how word co-occurrence statistics—used as a proxy for visual concept co-occurrence in pretraining text—affect compositional generalization. We employ pointwise mutual information (PMI) to disentangle co-occurrence frequency from unigram word frequencies, enabling a more semantically grounded measure of conceptual compatibility. Contribution/Results: We systematically validate, for the first time, strong correlations between PMI and model accuracy across both synthetic and real-world images: r = 0.97 on synthetic composition tasks (14% accuracy gap), r = 0.75 on image editing, and r = 0.70 and 0.62 on TextVQA and VQAv2, respectively. These results demonstrate that PMI effectively captures semantic plausibility of concept combinations and exhibits robust cross-model transferability, offering an interpretable, quantitative statistical lens into the mechanisms underlying multimodal compositional generalization.
📝 Abstract
CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear -- for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong correlation between PMI in the CLIP pretraining data and zero-shot accuracy in CLIP models trained on LAION-400M (r=0.97 and 14% accuracy gap between images in the top and bottom 5% of PMI values), demonstrating that even accuracy on common concepts is affected by the combination of concepts in the image. Leveraging this finding, we reproduce this effect in natural images by editing them to contain pairs with varying PMI, resulting in a correlation of r=0.75. Finally, we demonstrate that this behavior in CLIP transfers to LMMs built on top of CLIP (r=0.70 for TextVQA, r=0.62 for VQAv2). Our findings highlight the need for algorithms and architectures that improve compositional generalization in multimodal models without scaling the training data combinatorially. Our code is available at https://github.com/helenqu/multimodal-pretraining-pmi.