🤖 AI Summary
This work addresses the challenges in developing multimodal research agents capable of explicit reasoning, multi-tool invocation, and cross-modal fusion—challenges stemming from the scarcity of search-intensive multimodal QA data, the absence of effective search trajectories, and the high cost of training with online APIs. To overcome these limitations, the authors propose Hyper-Search, a hypergraph-based question answering generation method, integrated with the Divide-and-Conquer Tool-Expert Optimization framework (DR-TTS). This approach decomposes complex tasks to train specialized search tools and employs tree search to recombine expert modules for efficient trajectory exploration. Furthermore, an offline search engine supporting multi-tool calls is constructed to enable reinforcement learning for the agent. The proposed method substantially reduces training costs while achieving state-of-the-art performance across multiple benchmarks, successfully yielding a highly capable, cost-efficient multimodal research agent.
📝 Abstract
We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each tool. It then recomposes tool experts to jointly explore search trajectories via tree search, producing trajectories that successfully solve complex tasks using various search tools. Third, we build an offline search engine supporting multiple search tools, enabling agentic reinforcement learning without using costly online search APIs. With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks. Code is available at https://github.com/HJYao00/MM-DeepResearch