MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LVLM evaluations predominantly focus on closed-ended tasks, failing to adequately characterize open-ended associative reasoning—such as creative联想 and cross-domain knowledge integration. To address this gap, we introduce MM-OPERA, the first psychometrically grounded visual-language open associative reasoning benchmark, comprising 11,497 distal and contextual association instances. Methodologically, we propose an LLM-as-a-Judge framework augmented with process-oriented reward analysis, enabling fine-grained, interpretable evaluation of both free-form responses and underlying reasoning paths. The benchmark supports cross-domain, cross-cultural, and multilingual assessment. Empirical evaluation reveals substantial limitations in state-of-the-art LVLMs across associative depth, semantic sensitivity, and output diversity. This work establishes a novel evaluation paradigm and empirical foundation for developing human-like creative AI.

Technology Category

Application Category

📝 Abstract
Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.
Problem

Research questions and friction points this paper is trying to address.

Evaluating open-ended association reasoning in vision-language models
Addressing limitations of current benchmarks for creative thinking
Assessing models' ability for divergent and convergent associative reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for open-ended association reasoning tasks
Tailored LLM-as-a-Judge evaluation strategies
Process-reward-informed judgment for precise reasoning analysis
🔎 Similar Papers
Z
Zimeng Huang
Sun Yat-sen University
J
Jinxin Ke
Sun Yat-sen University
X
Xiaoxuan Fan
Jinan University
Y
Yufeng Yang
Sun Yat-sen University
Y
Yang Liu
Sun Yat-sen University
L
Liu Zhonghan
Sun Yat-sen University
Z
Zedi Wang
Sun Yat-sen University
J
Junteng Dai
Sun Yat-sen University
Haoyi Jiang
Haoyi Jiang
Huazhong University of Science and Technology
Computer VisionAutonomous Driving
Y
Yuyu Zhou
Jinan University
K
Keze Wang
Sun Yat-sen University
Ziliang Chen
Ziliang Chen
AP, Pengcheng Lab
Machine learningFoundation ModelsMultimodal Embodied Intelligence