🤖 AI Summary
Existing multimodal reasoning benchmarks at the Olympiad level are largely confined to single-image analysis, making them inadequate for evaluating models’ capacity to perform higher-order reasoning that integrates contextual information across multiple images. To address this gap, this work introduces OMIBench—the first systematically constructed benchmark specifically designed for multi-image Olympiad-level reasoning, spanning biology, chemistry, mathematics, and physics. OMIBench provides human-annotated, structured reasoning chains and employs a dual-track evaluation protocol combining exact-match and semantic-match metrics. Experimental results reveal that even state-of-the-art large vision-language models (LVLMs), such as Gemini-1.5-Pro, achieve only around 50% accuracy on this benchmark, underscoring that complex cross-image reasoning remains a significant challenge and establishing OMIBench as a critical tool for assessing LVLMs’ advanced multimodal reasoning capabilities.
📝 Abstract
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.