OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Existing multimodal reasoning benchmarks at the Olympiad level are largely confined to single-image analysis, making them inadequate for evaluating models’ capacity to perform higher-order reasoning that integrates contextual information across multiple images. To address this gap, this work introduces OMIBench—the first systematically constructed benchmark specifically designed for multi-image Olympiad-level reasoning, spanning biology, chemistry, mathematics, and physics. OMIBench provides human-annotated, structured reasoning chains and employs a dual-track evaluation protocol combining exact-match and semantic-match metrics. Experimental results reveal that even state-of-the-art large vision-language models (LVLMs), such as Gemini-1.5-Pro, achieve only around 50% accuracy on this benchmark, underscoring that complex cross-image reasoning remains a significant challenge and establishing OMIBench as a critical tool for assessing LVLMs’ advanced multimodal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

Problem

Research questions and friction points this paper is trying to address.

multi-image reasoning

vision-language models

Olympiad-level reasoning

multimodal benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-image reasoning

vision-language models

Olympiad-level benchmark

OMIBench