MIBench: Evaluating LMMs on Multimodal Interaction

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large multimodal language models (LMMs) lack systematic evaluation of their multimodal interaction capabilities. To address this gap, this work proposes MIBench, a structured benchmark that, for the first time, assesses LMMs along two dimensions—modality source bias and multimodal collaborative generation—across three cognitive levels: recognition, comprehension, and reasoning. The benchmark comprises 32 task categories and over 10,000 sample pairs, organized into a unified framework (con_v, con_t, task). Evaluation using MIBench reveals pervasive issues in existing LMMs, including strong text-dominant bias and weak collaborative generation ability. Notably, even native multimodal models exhibit fundamental deficiencies in basic interaction mechanisms. These findings provide critical insights and concrete directions for future research in multimodal interaction modeling.

Technology Category

Application Category

📝 Abstract
In different multimodal scenarios, it needs to integrate and utilize information across modalities in a specific way based on the demands of the task. Different integration ways between modalities are referred to as "multimodal interaction". How well a model handles various multimodal interactions largely characterizes its multimodal ability. In this paper, we introduce MIBench, a comprehensive benchmark designed to evaluate the multimodal interaction capabilities of Large Multimodal Models (LMMs), which formulates each instance as a (con_v , con_t, task) triplet with contexts from vision and text, necessitating that LMMs employ correct forms of multimodal interaction to effectively complete the task. MIBench assesses models from three key aspects: the ability to source information from vision-centric or text-centric cues, and the ability to generate new information from their joint synergy. Each interaction capability is evaluated hierarchically across three cognitive levels: Recognition, Understanding, and Reasoning. MIBench comprises over 10,000 vision-text context pairs spanning 32 distinct tasks. Evaluation of state-of-the-art LMMs show that: (1) LMMs' ability on multimodal interaction remains constrained, despite the scaling of model parameters and training data; (2) they are easily distracted by textual modalities when processing vision information; (3) they mostly possess a basic capacity for multimodal synergy; and (4) natively trained multimodal models show noticeable deficits in fundamental interaction ability. We expect that these observations can serve as a reference for developing LMMs with more enhanced multimodal ability in the future.
Problem

Research questions and friction points this paper is trying to address.

multimodal interaction
Large Multimodal Models
vision-text integration
multimodal evaluation
cross-modal synergy
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal interaction
MIBench
large multimodal models
cross-modal integration
hierarchical evaluation
🔎 Similar Papers
No similar papers found.
Y
Yu Miao
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Beijing, China
Z
Zequn Yang
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Beijing, China
Yake Wei
Yake Wei
Renmin University of China
multimodal learning
Z
Ziheng Chen
Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China; Beijing Key Laboratory of Research on Large Models and Intelligent Governance, Beijing, China
H
Haotian Ni
Beihang University, Beijing, China
Haodong Duan
Haodong Duan
Shanghai AI Lab | CUHK | PKU
Computer VisionVideo UnderstandingMultimodal LearningGenerative AI
Kai Chen
Kai Chen
Shanghai AI Laboratory
LLMVLMComputer Vision
Di Hu
Di Hu
Tenure-track Associate Professor, Renmin University of China
Multimodal PerceptionMultimodal LearningMultimodal Interaction