🤖 AI Summary
This work investigates the underperformance of multimodal in-context learning (ICL) compared to its text-only counterpart in few-shot settings, a phenomenon whose underlying mechanism remains unclear. The study decomposes multimodal ICL into two stages—task mapping construction and transfer—and reveals, for the first time, its critical bottleneck: a misalignment between visual and textual representations at the inference level, coupled with unreliable propagation of task mappings across layers. To address this, the authors propose an inference-stage task mapping enhancement method, validated through cross-modal task mapping disentanglement analysis and inter-layer representation tracking. Experiments show that while multimodal ICL performs comparably to text-only ICL in zero-shot scenarios, it suffers significant degradation in few-shot settings; the proposed approach substantially improves the reliability of task mapping transfer and effectively mitigates this performance drop.
📝 Abstract
In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further propose a simple inference-stage enhancement method that reinforces task mapping transfer. Our results provide new insights into the mechanisms and limitations of multimodal ICL and suggest directions for more effective multimodal adaptation. Our code is available \href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{here}.