🤖 AI Summary
This study investigates whether vision-language models (VLMs) possess genuine reasoning capabilities in cross-distribution multimodal in-context learning (MM-ICL), or rely instead on shallow heuristics such as answer copying. Focusing on settings where support examples and queries originate from disparate datasets, the authors introduce “reasoning-augmented MM-ICL”—a novel paradigm that integrates generated explanatory rationales into the ICL process. Leveraging both open-source (3B–72B) and closed-source (Gemini 2.0) VLMs, they conduct prompt engineering, distribution-shift evaluation, and controlled ablation studies. Results reveal that VLM performance is remarkably insensitive to key factors—including the number of demonstrations, retrieval strategy, rationale quality, and data distribution—systematically indicating a failure to leverage demonstrations for task-specific reasoning. This work constitutes the first systematic empirical demonstration that current VLMs lack demonstration-driven generalization in MM-ICL, providing critical evidence to guide future research on interpretable, reasoning-aware multimodal modeling.
📝 Abstract
Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics -- such as copying or majority voting -- rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs ranging from 3B to 72B and proprietary models such as Gemini 2.0. We conduct controlled studies varying shot count, retrieval method, rationale quality, and distribution. Our results show limited performance sensitivity across these factors, suggesting that current VLMs do not effectively utilize demonstration-level information as intended in MM-ICL.