ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing image-text retrieval benchmarks struggle to evaluate models’ capabilities in domain-specific knowledge and complex multimodal reasoning. To address this gap, this work proposes the first multimodal retrieval benchmark structured along two axes: knowledge depth (spanning 5 major categories and 17 subcategories) and reasoning complexity (encompassing 6 types). The benchmark includes 16 visual data types and supports both multimodal and unimodal query formats. It further introduces hard negative samples, fine-grained reasoning categorization, and a reranking-rewriting enhancement strategy. Experiments across 23 state-of-the-art models reveal significant performance gaps in knowledge-intensive and reasoning-intensive tasks, with visual and spatial reasoning remaining key bottlenecks. The proposed enhancement strategies consistently yield measurable improvements.

Technology Category

Application Category

📝 Abstract

Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.

Problem

Research questions and friction points this paper is trying to address.

multimodal retrieval

professional knowledge

complex reasoning

benchmark

hard negatives

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal retrieval

reasoning skills

knowledge domains