Towards Long-horizon Agentic Multimodal Search

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
This work addresses key challenges in long-horizon multimodal search—namely visual information loss, context inflation, and high token costs—by introducing the LMM-Searcher framework. The framework pioneers a file-system-based visual representation mechanism that stores images externally and maps them to lightweight textual identifiers, enabling efficient active perception through a demand-driven fetch-image tool. Additionally, it incorporates a cross-modal multi-hop reasoning data synthesis pipeline to train specialized multimodal search agents. Experimental results across four benchmarks demonstrate that the approach scales effectively to 100-turn interactions, achieving state-of-the-art performance among open-source models on long-horizon tasks such as MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalization across diverse base models.

Technology Category

Application Category

📝 Abstract
Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.
Problem

Research questions and friction points this paper is trying to address.

long-horizon
multimodal search
context explosion
visual information management
token cost
Innovation

Methods, ideas, or system contributions that make the work stand out.

file-based visual representation
on-demand visual loading
long-horizon multimodal search
cross-modal multi-hop reasoning
multimodal agent