Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models

📅 2025-06-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study investigates whether vision-language models (VLMs) possess genuine reasoning capabilities in cross-distribution multimodal in-context learning (MM-ICL), or rely instead on shallow heuristics such as answer copying. Focusing on settings where support examples and queries originate from disparate datasets, the authors introduce “reasoning-augmented MM-ICL”—a novel paradigm that integrates generated explanatory rationales into the ICL process. Leveraging both open-source (3B–72B) and closed-source (Gemini 2.0) VLMs, they conduct prompt engineering, distribution-shift evaluation, and controlled ablation studies. Results reveal that VLM performance is remarkably insensitive to key factors—including the number of demonstrations, retrieval strategy, rationale quality, and data distribution—systematically indicating a failure to leverage demonstrations for task-specific reasoning. This work constitutes the first systematic empirical demonstration that current VLMs lack demonstration-driven generalization in MM-ICL, providing critical evidence to guide future research on interpretable, reasoning-aware multimodal modeling.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics -- such as copying or majority voting -- rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.

Problem

Research questions and friction points this paper is trying to address.

VLMs rely on shallow heuristics not true understanding

Performance degrades with more demonstrations in distribution shifts

Current VLMs fail to utilize demonstration-level information effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating VLMs under distribution shifts

Proposing MM-ICL with Reasoning pipeline

Augmenting demonstrations with generated rationales

🔎 Similar Papers

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning