InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Existing vision-language search benchmarks utilize visual evidence only in the input or final answer, failing to support iterative interweaving of textual and visual information throughout the search process. To address this limitation, this work proposes InterLV-Agent, the first comprehensive benchmark that enables active acquisition of visual evidence, offline control, and interleaved multimodal search in open-web environments. The benchmark comprises three hierarchical tasks and a multimodal, multi-branch sample design: the first two levels are constructed via an automated pipeline, while the third integrates machine-generated data with human supervision. Experimental results reveal that current open- and closed-source multimodal agents achieve overall accuracy below 50%, highlighting significant challenges in visual evidence retrieval, search control, and multimodal fusion.

📝 Abstract

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

Problem

Research questions and friction points this paper is trying to address.

multimodal agentic search

interleaved search

visual evidence seeking

multimodal benchmark

language-vision integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

interleaved multimodal search

visual evidence seeking

multimodal agentic benchmark