đ¤ AI Summary
Existing deep research agents often suffer from error accumulation during data synthesis, trajectory construction, and reasoning in long-horizon tasks due to the absence of explicit validation mechanisms, which limits their performance. This work proposes the first end-to-end, verification-centric deep research agent framework that integrates explicit validation across three key stages: question-answer pair synthesis, training trajectory construction, and test-time reasoning. Specifically, it leverages a graph-agent collaboration to generate difficulty-controllable QA pairs, constructs high-quality, verification-driven trajectories, and employs the agent itself as a runtime verifier during inference. The proposed method significantly outperforms 8B-scale models on challenging benchmarks such as BrowseComp and BrowseComp-ZH, and under a strict limit of 600 tool calls, matches or exceeds the performance of several 30B-scale modelsâincluding Tongyi DeepResearch-30Bâdemonstrating markedly improved reliability and efficiency in complex open-ended tasks.
đ Abstract
Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: \textbf{(1)~QA Data Synthesis:} We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; \textbf{(2)~Trajectory Construction:} We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and \textbf{(3)~Test-time scaling:} We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.