The in-context inductive biases of vision-language models differ across modalities

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This study investigates modality-specific inductive biases in vision-language models (VLMs) arising from input modality (image vs. text) and textual description format during in-context learning. Method: Adopting a cognitive science-inspired category generalization paradigm, we conduct controlled cross-modal experiments across three representative VLMs—CLIP, Flamingo, and KOSMOS—to isolate the effects of visual versus linguistic exemplars and syntactic properties of textual descriptions. Contribution/Results: We find that (1) visual exemplars robustly strengthen shape-over-color generalization; (2) adjective ordering in textual exemplars directionally modulates generalization, with effects contingent on model architecture and task; and (3) such biases are not universal but dynamically vary with model design and training paradigm. This work provides the first empirical evidence of modality-specific mechanisms underlying cross-modal generalization in VLMs, offering theoretical insights into their inductive priors and actionable guidance for improving multimodal alignment.

Technology Category

Application Category

📝 Abstract

Inductive biases are what allow learners to make guesses in the absence of conclusive evidence. These biases have often been studied in cognitive science using concepts or categories -- e.g. by testing how humans generalize a new category from a few examples that leave the category boundary ambiguous. We use these approaches to study generalization in foundation models during in-context learning. Modern foundation models can condition on both vision and text, and differences in how they interpret and learn from these different modalities is an emerging area of study. Here, we study how their generalizations vary by the modality in which stimuli are presented, and the way the stimuli are described in text. We study these biases with three different experimental paradigms, across three different vision-language models. We find that the models generally show some bias towards generalizing according to shape over color. This shape bias tends to be amplified when the examples are presented visually. By contrast, when examples are presented in text, the ordering of adjectives affects generalization. However, the extent of these effects vary across models and paradigms. These results help to reveal how vision-language models represent different types of inputs in context, and may have practical implications for the use of vision-language models.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Generalization Ability

Inductive Bias

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Models

Generalization Preferences

Input Modality Effects

🔎 Similar Papers

Can We Talk Models Into Seeing the World Differently?