OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

📅 2026-05-06

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the reproducibility challenges of state-of-the-art multimodal search agents, which stem from the lack of high-quality open-source data, transparent trajectory synthesis procedures, and complete training protocols. To overcome these limitations, we propose an end-to-end open-source training framework that innovatively combines Wikipedia path sampling with fuzzy entity rewriting to construct diverse training data. We design a unified multimodal tool environment supporting text and image search, OCR, image enhancement, and other capabilities, and introduce a multi-turn fatal-error-aware GRPO algorithm to effectively mitigate one-step retrieval collapse and tool cascade failures. Our approach achieves an average improvement of over 10 points across seven benchmarks, matching or surpassing the performance of closed-source commercial models on multiple tasks, while fully releasing all data, code, and trained models.

📝 Abstract

Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.

Problem

Research questions and friction points this paper is trying to address.

multimodal search agents

reproducibility

training data

trajectory synthesis

open-source recipe

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal search agents

agentic reinforcement learning

tool-augmented reasoning