🤖 AI Summary
Existing multimodal large language models exhibit limitations in complex multi-hop visual reasoning tasks that require deep search and integration of external knowledge. This work proposes the Multi-hop Tool-augmented Agent (MTA-Agent), which automatically selects tools, retrieves and verifies evidence across visual and textual sources, and generates structured reasoning trajectories. The core contributions include the construction of MTA-Vision-DeepSearch, a high-quality multi-hop vision-language training dataset; the introduction of multi-stage factual consistency verification, cache-based interaction replay training, and large-scale multimodal data synthesis strategies; and the first open-source, cost-effective multimodal deep-search agent framework that operates without real-time tool invocation. A 32B open-source model trained on this data achieves an average accuracy of 54.63% across six benchmarks, outperforming GPT-5, Gemini-2.5-Pro, and Gemini-3-Pro, while increasing the average reasoning depth from 2.27 to 4.28 steps.
📝 Abstract
Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63\% across six challenging benchmarks, outperforming GPT-5 (51.86\%), Gemini-2.5-Pro (50.98\%), and Gemini-3-Pro (54.46\%) under the same tool settings. We further show that training on our data improves both reasoning depth and tool-use behavior, increasing the average number of steps from 2.27 to 4.28, and leading to more systematic and persistent search strategies. Additionally, we demonstrate that training can be performed without real-time tool calls by replaying cached interactions, significantly reducing training cost. Importantly, we present MTA-Agent as a fully open recipe for multimodal deep search: we release the entire dataset, training trajectories, and implementation details to enable reproducibility and future research on open multimodal search agents.