🤖 AI Summary
Existing multimodal large language models (MLLMs) lack systematic evaluation on open-domain, expert-level tasks across diverse disciplines. Method: We introduce ProBench—the first professional-grade, open-domain multimodal benchmark—comprising 4,000 real-world questions authored by domain experts across 10 broad fields and 56 subfields, emphasizing visual perception, textual understanding, domain-specific knowledge, and higher-order reasoning. We propose a novel evaluation paradigm grounded in real expert productivity requirements, establishing the first systematic framework for interdisciplinary, high-difficulty, open-ended generative multimodal tasks, and design an MLLM-as-a-Judge automated evaluation protocol. Contribution/Results: Comprehensive evaluation of 24 state-of-the-art MLLMs reveals that open-source models now approach closed-source counterparts in overall performance, yet exhibit persistent bottlenecks in domain-specific reasoning and cross-modal coordination. ProBench provides a reproducible, scalable evaluation infrastructure and concrete optimization directions for advancing general multimodal intelligence.
📝 Abstract
Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.