🤖 AI Summary
Current multimodal large language models (MLLMs) lack comprehensive, reproducible evaluation benchmarks, hindering accurate characterization of their multimodal understanding capabilities. To address this, we introduce MME—the first holistic benchmark for MLLM evaluation—comprising 14 fine-grained subtasks (e.g., OCR, visual reasoning, commonsense reasoning) that systematically assess both perceptual and cognitive abilities. All instruction-answer pairs are manually crafted to prevent data leakage and training contamination; a minimal, generic instruction template is employed to decouple intrinsic model capability from prompt engineering effects. MME provides a standardized evaluation protocol, an open-source dataset, and a public leaderboard. Applying MME to uniformly evaluate 30 state-of-the-art MLLMs reveals critical bottlenecks in cross-modal alignment and fine-grained perception, offering concrete insights for future research and development.
📝 Abstract
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 30 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization. The data application manner and online leaderboards are released at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation.