🤖 AI Summary
Current multimodal models struggle to learn local rules, procedural steps, and empirical patterns from instructional contexts—such as images, videos, or manuals—and generalize them to novel visual instances. To address this gap, this work introduces MMCL-Bench, the first systematic benchmark for evaluating multimodal in-context learning (MMCL) capabilities, comprising 102 tasks across three categories: rule application, procedure execution, and empirical generalization. These tasks require models to locate relevant evidence within multimodal contexts and perform reasoning accordingly. Through rigorous scoring criteria and ablation analyses, we uncover critical bottlenecks in state-of-the-art models across the full pipeline—from contextual anchoring and visual evidence extraction to reasoning and response generation. Experimental results show that even the best-performing models solve fewer than one-third of the tasks on average, highlighting MMCL as a significant and underdeveloped capability in contemporary multimodal AI systems.
📝 Abstract
We introduce MMCL-Bench, a benchmark for multimodal context learning: learning task-local rules, procedures, and empirical patterns from visual or mixed-modality teaching context and applying them to new visual instances. Unlike text-only context learning or standard multimodal question answering, this setting requires models to recover and localize relevant evidence from images, screenshots, manuals, videos, and frame sequences before they can reason over the learned context. MMCL-Bench contains 102 tasks spanning three categories: rule system application, procedural task execution, and empirical discovery and induction. We evaluate frontier multimodal models with strict rubric-based scoring and find that current systems remain far from robust multimodal context learning, with even the strongest model solving fewer than one-third of tasks under strict evaluation. Diagnostic ablations and error analysis show that failures arise throughout the context-to-answer pipeline, including context anchoring, visual evidence extraction, context reasoning, and response construction. MMCL-Bench thus highlights multimodal context learning as an important unsolved capability bottleneck for current multimodal models.