IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing deep research agents rely on a single-context accumulation paradigm, leading to context saturation and noise contamination that severely degrade performance on long-horizon tasks. This work formalizes long-term research as a Markov decision process and proposes an iterative deep research paradigm featuring dynamic workspace reconstruction, continual report updating, and efficiency-aware policy optimization—enabling stable, thousand-step reasoning. Key technical contributions include Markov state reconstruction, geometrically discounted reinforcement learning, adaptive downsampling, and native support for distributed training. Evaluated across six benchmarks, our approach achieves an average 14.5-percentage-point improvement. Notably, at 2048 interaction steps, performance rises significantly from 3.5% to 42.5%; state-of-the-art large language models benefit by up to 19.2 percentage points.

Technology Category

Application Category

📝 Abstract

Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5% to 42.5%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.

Problem

Research questions and friction points this paper is trying to address.

Addresses context suffocation in long-horizon research tasks

Solves noise contamination from expanding context windows

Improves autonomous knowledge construction through iterative reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative research paradigm with Markovian state reconstruction

Efficiency-Aware Policy Optimization with geometric reward discounting

Maintains evolving report as memory for consistent reasoning

🔎 Similar Papers

Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments

2024-07-14arXiv.orgCitations: 1

Bosch Group

Renningen, BW, DE

Master Thesis Bridging the Gap between Reinforcement Learning & E2E Driving

Bosch Group

Renningen, BW, DE

Authors to Follow