🤖 AI Summary
Current multimodal large language models (MLLMs) exhibit significant limitations in relational reasoning—primarily due to the scarcity of large-scale, high-quality relation-annotated data. To address this, we introduce MMRel, the first dedicated benchmark for relational understanding in MLLMs. Our method comprises three key components: (1) constructing 22K cross-domain, multi-category, expert-annotated vision-language question-answer pairs, covering spatial, functional, and semantic relations; (2) proposing relation-hallucination–guided adversarial samples to enhance evaluation robustness; and (3) designing MMRel to support both rigorous evaluation and fine-tuning. Extensive experiments demonstrate that MMRel substantially improves MLLM accuracy and generalization on relational reasoning tasks. The benchmark is publicly released and has been widely adopted for model diagnostics and training augmentation.
📝 Abstract
Though Multi-modal Large Language Models (MLLMs) have recently achieved significant progress, they often face various problems while handling inter-object relations, i.e., the interaction or association among distinct objects. This constraint largely stems from insufficient training and evaluation data for relation understanding, which has greatly impeded MLLMs in various vision-language generation and reasoning tasks. We attempt to address this challenge by introducing Multi-Modal Relation Understanding (MMRel), a benchmark that features large-scale, high-quality, and diverse data on inter-object relations. MMRel features three distinctive attributes: (i) It contains over 22K question-answer pairs, spanning three distinct domains and covering three relation categories, ensuring both scale and diversity; (ii) it provides manually verified, high-quality labels to ensure exceptional annotation accuracy; (iii) it includes adversarial cases with highly unusual relations, offering a challenging setting for evaluating relation hallucination. These features make MMRel ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability. Extensive experiments verify the effectiveness of MMRel in evaluating and enhancing MLLMs' relation understanding capabilities. The benchmark has been released publicly at: https://niejiahao1998.github.io/MMRel/