SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of learning in resource-constrained environments, where self-evolving agents suffer from sparse rewards and the absence of structured memory. To overcome these challenges, the paper introduces the SEARL framework, which uniquely integrates structured tool-graph memory with policy learning in a joint optimization scheme. By modeling inter-trajectory dependencies, SEARL densifies reward signals and enables explicit knowledge extraction and cross-task transfer. Combining reinforcement learning with verifiable rewards (RLVR), state abstraction, and trajectory correlation modeling, SEARL significantly enhances both sample efficiency and generalization on knowledge-intensive reasoning and mathematical tasks. Empirical results validate its effectiveness and practicality in settings with limited computational and environmental resources.
📝 Abstract
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.
Problem

Research questions and friction points this paper is trying to address.

self-evolving agents
sparse rewards
resource-constrained environments
tool reuse
trajectory learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Memory
Self-Evolving Agents
Joint Optimization
Reward Densification
Structured Experience Memory
🔎 Similar Papers
No similar papers found.