🤖 AI Summary
Existing instruction-tuning datasets lack controllable difficulty and rigorous quality assurance, hindering long-horizon web reasoning; moreover, data efficacy is often conflated with training dynamics, impeding independent evaluation. Method: We propose a dual-path controllable data synthesis framework: (i) knowledge-graph-guided task generation and (ii) multi-role agent collaboration (questioning, verification, filtering) for iterative distillation—enabling fine-grained difficulty progression and factual consistency validation. Crucially, we decouple data construction from model training to enable standalone data quality assessment—a first in the web agent domain. Results: Experiments show our synthesized dataset—despite smaller scale—achieves 100% higher tool-call diversity and substantially reduces redundant API invocations. Web agents trained on it attain state-of-the-art performance across multiple benchmarks and demonstrate markedly improved long-horizon reasoning capabilities.
📝 Abstract
Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.