🤖 AI Summary
Constructing high-quality training data for deep research tasks requiring multi-step reasoning is often hindered by the high cost of human annotation or complex prerequisite conditions. This work proposes an efficient, modular framework that avoids reliance on paid APIs and instead leverages a two-stage pipeline—seed generation, question-answer synthesis, followed by self-verification and external verification—to scalably produce 20K high-quality samples across 15 domains. Each sample demands 4–5 reasoning steps and is verifiable against open-web sources. Using this data, the authors train ORBIT-4B, a search agent based on the Qwen3-4B model and optimized with the GRPO algorithm, which significantly outperforms all existing models under 4 billion parameters on Wikipedia-based question answering benchmarks, demonstrating the effectiveness and practicality of the proposed data generation paradigm.
📝 Abstract
Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question--answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4--5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.