Can We Talk Models Into Seeing the World Differently?

📅 2024-03-14
📈 Citations: 10
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the dynamic preference of vision-language models (VLMs) for visual cues—specifically texture versus shape—during multimodal fusion, and examines its controllability. Methodologically, we conduct systematic prompt engineering, bias quantification, and attribution analysis across mainstream VLM architectures (e.g., CLIP, BLIP). Our key contributions are: (1) Joint vision-language training intrinsically enhances global/shape sensitivity in VLMs, markedly attenuating the local/texture bias inherent to standalone visual encoders; (2) Natural language prompts can efficiently steer models toward texture-based judgments, whereas shape-directed reasoning remains comparatively challenging. These findings empirically demonstrate that linguistic queries actively modulate low- and mid-level visual processing mechanisms—not merely high-level semantics—thereby revealing a previously underexplored form of cross-modal control. The study provides both theoretical insight and empirical grounding for developing controllable, cue-aware multimodal perception systems.

Technology Category

Application Category

📝 Abstract
Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is how such (potentially misaligned) biases and cue preferences behave under multi-modal fusion in VLMs. As a first step towards a better understanding, we investigate a particularly well-studied vision-only bias - the texture vs. shape bias and the dominance of local over global information. As expected, we find that VLMs inherit this bias to some extent from their vision encoders. Surprisingly, the multi-modality alone proves to have important effects on the model behavior, i.e., the joint training and the language querying change the way visual cues are processed. While this direct impact of language-informed training on a model's visual perception is intriguing, it raises further questions on our ability to actively steer a model's output so that its prediction is based on particular visual cues of the user's choice. Interestingly, VLMs have an inherent tendency to recognize objects based on shape information, which is different from what a plain vision encoder would do. Further active steering towards shape-based classifications through language prompts is however limited. In contrast, active VLM steering towards texture-based decisions through simple natural language prompts is often more successful. URL: https://github.com/paulgavrikov/vlm_shapebias
Problem

Research questions and friction points this paper is trying to address.

Investigates biases in vision-language models (VLMs) during multi-modal fusion.
Explores how language prompts influence visual cue processing in VLMs.
Examines limitations in steering VLMs towards shape or texture-based classifications.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines language and vision models for intuitive access.
Investigates biases in multi-modal vision language models.
Explores steering model outputs via language prompts.
🔎 Similar Papers
No similar papers found.