SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

📅 2025-09-07

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

Single-agent systems in deep research tasks suffer from insufficient autonomy, overreliance on predefined workflows, and frequent external API calls. Method: This paper proposes a lightweight reinforcement learning training paradigm that continuously optimizes open-source large language models using fully synthetic data. It abandons fixed agent roles and static workflows, instead designing a reasoning-augmented autonomous agent capable of dynamic action selection while minimizing dependence on web scraping and Python tool integration. Contribution/Results: The key innovation is the first tight integration of reasoning optimization mechanisms with autonomous agent architecture, enabling an end-to-end, low-intervention deep research loop. The resulting model, SFR-DR-20B, achieves 28.7% on the Humanity’s Last Exam benchmark—marking a substantial improvement in single-agent autonomous decision-making and holistic analytical capability for complex reasoning tasks.

Technology Category

Application Category

📝 Abstract

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

Problem

Research questions and friction points this paper is trying to address.

Develop autonomous single-agent models for deep research

Enhance reasoning and tool-use via reinforcement learning

Enable dynamic action selection without manual directives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses continual reinforcement learning for reasoning-optimized models

Employs synthetic data for autonomous single-agent training

Integrates minimal web crawling with Python tools

🔎 Similar Papers

A Role of Environmental Complexity on Representation Learning in Deep Reinforcement Learning Agents