Aligning MLLM Benchmark With Human Preferences via Structural Equation Modeling

📅 2025-06-13
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current MLLM evaluation lacks theoretically grounded, structurally principled, and cognitively interpretable benchmarks—suffering from heuristic task grouping, ill-defined capability taxonomies, redundant metrics, and weak diagnostic power. To address this, we propose GOLD: the first hierarchical evaluation framework for MLLMs grounded in Piaget’s theory of cognitive development, organizing capabilities into three orthogonal levels—Perception, Memory, and Reasoning. GOLD pioneers the integration of Structural Equation Modeling (SEM) into MLLM assessment, enabling rigorous capability dimension orthogonalization, quantification of internal validity, and fine-grained contribution attribution. Through task remapping and metric decoupling, GOLD achieves substantial improvements: +42% expert agreement, −68% metric redundancy, and a 3.1× increase in cross-task capability separation. Its diagnostic accuracy surpasses mainstream benchmarks including MMBench and OCRBench.

Technology Category

Application Category

📝 Abstract
Evaluating multimodal large language models (MLLMs) is fundamentally challenged by the absence of structured, interpretable, and theoretically grounded benchmarks; current heuristically-grouped tasks have vague cognitive targets, overlapping abilities, redundant indicators, and weak diagnostic power. We therefore propose a structural-equation-modeling-aligned framework that quantifies internal validity, dimensional separability, and component contributions, and introduce a Piaget-inspired capability hierarchy that stratifies MLLM abilities into Perception, Memory, and Reasoning. Reorganizing existing tasks under this theory, we build the GOLD benchmark, whose experiments show superior interpretability, lower indicator redundancy, and clearer cognitive consistency than prior benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Developing interpretable benchmarks for multimodal large language models
Addressing overlapping abilities and redundant evaluation indicators
Establishing theoretical foundations for MLLM capability assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural equation modeling framework for MLLM evaluation
Piaget-inspired hierarchical capability stratification
Reorganized benchmark with reduced redundancy
🔎 Similar Papers
No similar papers found.
T
Tianyu Zou
School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, China, also with Sanya Science and Education Innovation Park, Wuhan University of Technology, Sanya 572000, China, and also with the Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Shengwu Xiong
Shengwu Xiong
Wuhan University of Technology
Artificial Intelligence
R
Ruilin Yao
J
Jirui Huang
Yi Rong
Yi Rong
Yaxiong Chen
Yaxiong Chen
Wuhan University of Technology
deep hashing、deep learning
Shili Xiong
Shili Xiong
C
Cong Wang
School of Mathematics and Statistics, Northwestern Polytechnical University, Xi’an 710129, China, and also with Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China