MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

📅 2024-07-01

🏛️ arXiv.org

📈 Citations: 10

✨ Influential: 1

career value

198K/year

🤖 AI Summary

Existing evaluation benchmarks inadequately assess multimodal large language models’ (MLLMs) capability to follow complex, hierarchical instructions. Method: We introduce MIA-Bench—a rigorously constructed benchmark comprising 400 image-instruction pairs—and propose “instruction fidelity” as a novel, core evaluation dimension. We design structured instruction challenges and a fine-grained compliance assessment protocol. High-quality test samples are generated via human-crafted templates and pattern constraints; model adherence is improved through supervised fine-tuning and instruction-augmented data. Contribution/Results: Experiments reveal substantial performance gaps among state-of-the-art MLLMs on MIA-Bench. Targeted fine-tuning boosts average instruction compliance by 23.6% without degrading general vision-language capabilities. This work establishes a new paradigm for systematic evaluation and optimization of MLLM instruction-following behavior.

Technology Category

Application Category

📝 Abstract

We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large language models (MLLMs) on their ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

Problem

Research questions and friction points this paper is trying to address.

Evaluate multimodal LLMs instruction adherence

Benchmark with 400 challenging image-prompt pairs

Enhance model instruction fidelity via training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal benchmark for LLMs

Image-prompt pairs evaluation

Supervised fine-tuning enhancement

🔎 Similar Papers

Surveying the MLLM Landscape: A Meta-Review of Current Surveys