ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing multimodal large language models (MLLMs) lack systematic evaluation on open-domain, expert-level tasks across diverse disciplines. Method: We introduce ProBench—the first professional-grade, open-domain multimodal benchmark—comprising 4,000 real-world questions authored by domain experts across 10 broad fields and 56 subfields, emphasizing visual perception, textual understanding, domain-specific knowledge, and higher-order reasoning. We propose a novel evaluation paradigm grounded in real expert productivity requirements, establishing the first systematic framework for interdisciplinary, high-difficulty, open-ended generative multimodal tasks, and design an MLLM-as-a-Judge automated evaluation protocol. Contribution/Results: Comprehensive evaluation of 24 state-of-the-art MLLMs reveals that open-source models now approach closed-source counterparts in overall performance, yet exhibit persistent bottlenecks in domain-specific reasoning and cross-modal coordination. ProBench provides a reproducible, scalable evaluation infrastructure and concrete optimization directions for advancing general multimodal intelligence.

Technology Category

Application Category

📝 Abstract

Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.

Problem

Research questions and friction points this paper is trying to address.

Evaluating advanced multimodal intelligence in expert tasks

Developing a benchmark for professional-level multimodal reasoning

Identifying challenges in visual, textual, and domain-specific reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

ProBench: expert-level multimodal task benchmark

MLLM-as-a-Judge evaluates 24 latest models

Covers 10 fields, 56 sub-fields, 4000 samples

🔎 Similar Papers

No similar papers found.

Authors to Follow