Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In long-horizon agent tasks, large language models (LLMs) suffer from fixed-length context windows and are prone to interference from irrelevant information, leading to working memory overload. To address this, we propose the “Memory-as-Action” framework, which formalizes memory editing—such as selection, compression, and deletion—as intrinsic, policy-network-learnable actions, enabling joint optimization of memory management and task decision-making. To mitigate trajectory fragmentation caused by memory edits, we design a Dynamic Context Policy Optimization algorithm that integrates segment-wise advantage estimation with explicit memory operations, trained end-to-end within a reinforcement learning framework. Our approach significantly reduces computational overhead, improves performance on long-horizon tasks, and demonstrates adaptive context modulation aligned with model capabilities—without requiring external memory modules or architectural modifications.

Technology Category

Application Category

📝 Abstract
Large Language Models face challenges in long-horizon agentic tasks as their constrained memory is easily overwhelmed by distracting or irrelevant context. Existing working memory methods typically rely on external, heuristic mechanisms that are decoupled from the agent's core policy. In this work, we reframe working memory management as a learnable, intrinsic capability. We propose a novel framework, Memory-as-Action, where an agent actively manages its working memory by executing explicit editing operations as part of a unified policy. This formulation allows an agent, trained via reinforcement learning, to balance memory curation against long-term task objectives under given resource constraints. However, such memory editing actions break the standard assumption of a continuously growing prefix in LLM interactions, leading to what we call trajectory fractures. These non-prefix changes disrupt the causal continuity required by standard policy gradient methods, making those methods inapplicable. To address this, we propose a new algorithm, Dynamic Context Policy Optimization, which enables stable end-to-end reinforcement learning by segmenting trajectories at memory action points and applying trajectory-level advantages to the resulting action segments. Our results demonstrate that jointly optimizing for task reasoning and memory management in an end-to-end fashion not only reduces overall computational consumption but also improves task performance, driven by adaptive context curation strategies tailored to the model's intrinsic capabilities.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle with long tasks due to limited, cluttered memory capacity.
Existing memory methods are external heuristics, not integrated with agent policy.
Memory editing actions disrupt standard learning methods by breaking trajectory continuity.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learns memory management as intrinsic agent capability
Unifies memory editing actions within a single policy
Segments trajectories at memory edits for stable training
🔎 Similar Papers
No similar papers found.
Y
Yuxiang Zhang
School of Computer Science and Technology, Beijing Jiaotong University
J
Jiangming Shu
School of Computer Science and Technology, Beijing Jiaotong University
Y
Ye Ma
Hithink Research
Xueyuan Lin
Xueyuan Lin
PhD Student, HKUST(GZ) & IDEA
natrual language processingreinforcement learninggraph neural network
Shangxi Wu
Shangxi Wu
Computer Vision, Beijing Jiaotong University
LLM AgentBackdoor Attack
J
Jitao Sang
School of Computer Science and Technology, Beijing Jiaotong University