🤖 AI Summary
Current LVLM evaluations predominantly focus on closed-ended tasks, failing to adequately characterize open-ended associative reasoning—such as creative联想 and cross-domain knowledge integration. To address this gap, we introduce MM-OPERA, the first psychometrically grounded visual-language open associative reasoning benchmark, comprising 11,497 distal and contextual association instances. Methodologically, we propose an LLM-as-a-Judge framework augmented with process-oriented reward analysis, enabling fine-grained, interpretable evaluation of both free-form responses and underlying reasoning paths. The benchmark supports cross-domain, cross-cultural, and multilingual assessment. Empirical evaluation reveals substantial limitations in state-of-the-art LVLMs across associative depth, semantic sensitivity, and output diversity. This work establishes a novel evaluation paradigm and empirical foundation for developing human-like creative AI.
📝 Abstract
Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.