OffSeeker: Online Reinforcement Learning Is Not All You Need for Deep Research Agents

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the limitations of existing deep research agents, which either rely on costly online reinforcement learning or are constrained by the scarcity of high-quality offline research trajectories. To overcome these challenges, the authors propose a fully offline training paradigm centered on DeepForge, a novel task synthesis framework. Using this framework, they construct a large-scale, high-quality dataset comprising 66k question-answer pairs, 33k supervised fine-tuning trajectories, and 21k direct preference optimization (DPO) pairs. By integrating task synthesis, supervised fine-tuning, and DPO, they train OffSeeker, an 8B-parameter model that outperforms same-scale agents across six benchmarks and matches the performance of 30B-scale systems that depend on online reinforcement learning, thereby substantially reducing training costs.

Technology Category

Application Category

📝 Abstract

Deep research agents have shown remarkable potential in handling long-horizon tasks. However, state-of-the-art performance typically relies on online reinforcement learning (RL), which is financially expensive due to extensive API calls. While offline training offers a more efficient alternative, its progress is hindered by the scarcity of high-quality research trajectories. In this paper, we demonstrate that expensive online reinforcement learning is not all you need to build powerful research agents. To bridge this gap, we introduce a fully open-source suite designed for effective offline training. Our core contributions include DeepForge, a ready-to-use task synthesis framework that generates large-scale research queries without heavy preprocessing; and a curated collection of 66k QA pairs, 33k SFT trajectories, and 21k DPO pairs. Leveraging these resources, we train OffSeeker (8B), a model developed entirely offline. Extensive evaluations across six benchmarks show that OffSeeker not only leads among similar-sized agents but also remains competitive with 30B-parameter systems trained via heavy online RL.

Problem

Research questions and friction points this paper is trying to address.

online reinforcement learning

offline training

research agents

high-quality trajectories

cost efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

offline reinforcement learning

research agents

task synthesis