Pushing Test-Time Scaling Limits of Deep Search with Asymmetric Verification

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Test-time scaling (TTS) incurs prohibitive computational overhead and low efficiency in complex reasoning tasks. Method: This paper proposes an asymmetric verification–driven hybrid expansion framework—leveraging the inherent asymmetry that verification is significantly cheaper than generation—to jointly optimize sequential deep search and parallel candidate filtering. Key innovations include: (i) a lightweight budget-enforcement mechanism to guide resource allocation, (ii) verification-centric model enhancement (prioritizing verifier over generator), and (iii) a plug-and-play deep search agent architecture. Results: The method achieves efficient TTS on open-weight large models (GLM-4.5 Heavy, Tongyi-DeepResearch Heavy), attaining 54.0% and 66.0% accuracy on BrowseComp and GAIA benchmarks, respectively; notably, Tongyi-DeepResearch Heavy achieves 69.0% on BrowseComp—surpassing state-of-the-art closed-source systems and establishing, for the first time, the viability of asymmetric verification–driven TTS.

Technology Category

Application Category

📝 Abstract

Test-time compute can be scaled both sequentially and in parallel. Sequential scaling involves lengthening the generation process, while parallel scaling involves verifying and selecting among multiple candidate outputs. Combining these two strategies has led to the most powerful AI systems, such as Grok 4 Heavy and GPT-5 Pro. In certain contexts (e.g., solving Sudoku puzzles), verifying responses can be substantially easier than generating them. This property, referred to as emph{asymmetric verification}, highlights the strong potential of test-time scaling (TTS). In this work, we study both sequential and parallel TTS of deep search agents, motivated by the intuition that verification in this setting is often much easier than generation. In experiments, we first show that sequential scaling methods, such as budget forcing, can be effective initially but soon degrade performance. Leveraging asymmetric verification, however, we are able to achieve substantial improvements by allocating only a modest amount of compute to the verifier. We conduct experiments with flagship open-source models and extend them to their ``Heavy''variants through TTS. These deep research agents achieve gains of up to 27 absolute points on benchmarks such as BrowseComp. Remarkably, as an open-source alternative, GLM-4.5 Heavy reaches accuracy of {f 54.0%} on BrowseComp and {f 66.0%} on GAIA, placing it comparable to the best proprietary choices such as OpenAI Deep Research. Tongyi-DeepResearch Heavy further achieves {f 69.0%} accuracy on BrowseComp, greatly surpassing the best proprietary results.

Problem

Research questions and friction points this paper is trying to address.

Leveraging asymmetric verification to enhance test-time scaling of deep search agents

Overcoming performance degradation in sequential scaling methods through efficient verification

Improving accuracy of open-source models to compete with proprietary AI systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging asymmetric verification for efficient test-time scaling

Combining sequential and parallel scaling strategies

Allocating modest compute to verifiers for substantial gains

🔎 Similar Papers

2024-08-24arXiv.orgCitations: 0