MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 14

✨ Influential: 3

career value

194K/year

🤖 AI Summary

Existing multimodal evaluation benchmarks inadequately reflect real-world, heterogeneous daily usage scenarios and lack systematic assessment across diverse tasks and output formats. Method: We introduce the first fine-grained, real-scenario-oriented multimodal benchmark—comprising 505 practical scenarios and 8,000+ samples—supporting 16 input/output modalities and 40+ output formats (e.g., numbers, code, JSON, free-form text). We propose a four-dimensional capability reporting framework—“Application–Input–Output–Skill”—replacing monolithic multiple-choice evaluation with task-driven, format-aware, interpretable assessment. The benchmark integrates expert crowdsourced scenario sampling, 40+ customized automated metrics, multi-format parsers, and interactive visualization tools. Contribution/Results: Comprehensive evaluation of state-of-the-art vision-language models reveals, for the first time, their fine-grained capability boundaries and long-tail deficiencies across modality combinations and task types.

Technology Category

Application Category

📝 Abstract

We present MEGA-Bench, an evaluation suite that scales multimodal evaluation to over 500 real-world tasks, to address the highly heterogeneous daily use cases of end users. Our objective is to optimize for a set of high-quality data samples that cover a highly diverse and rich set of multimodal tasks, while enabling cost-effective and accurate model evaluation. In particular, we collected 505 realistic tasks encompassing over 8,000 samples from 16 expert annotators to extensively cover the multimodal task space. Instead of unifying these problems into standard multi-choice questions (like MMMU, MMBench, and MMT-Bench), we embrace a wide range of output formats like numbers, phrases, code, LaTeX, coordinates, JSON, free-form, etc. To accommodate these formats, we developed over 40 metrics to evaluate these tasks. Unlike existing benchmarks, MEGA-Bench offers a fine-grained capability report across multiple dimensions (e.g., application, input type, output format, skill), allowing users to interact with and visualize model capabilities in depth. We evaluate a wide variety of frontier vision-language models on MEGA-Bench to understand their capabilities across these dimensions.

Problem

Research questions and friction points this paper is trying to address.

Scales multimodal evaluation to 500+ real-world tasks

Optimizes diverse high-quality data for cost-effective evaluation

Evaluates models with 40+ metrics across varied output formats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales multimodal evaluation to 500+ tasks

Embraces diverse output formats with 40+ metrics

Provides fine-grained capability reports across dimensions

🔎 Similar Papers

No similar papers found.