🤖 AI Summary
This work addresses a critical limitation in existing software engineering agents, which store memories at the task level and consequently suffer from retrieval bias when encountering superficially similar but logically distinct subtasks. To mitigate this issue, the paper introduces a subtask-level memory mechanism aligned with the agent’s functional decomposition, enabling fine-grained, structurally consistent memory storage, retrieval, and updating. By refining memory granularity from the task level to structurally aligned subtasks—a first in the field—the approach effectively reduces mismatches between stored experiences and reasoning logic, substantially enhancing experience reuse in complex, long-horizon tasks. Evaluated on SWE-bench Verified, the method yields an average Pass@1 improvement of 4.7 percentage points, with gains as high as 6.8 points for the Gemini 2.5 Pro model, and the performance advantage intensifies with increasing interaction steps.
📝 Abstract
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents. Recent work has further explored augmenting these agents with memory mechanisms to support long-horizon reasoning. However, these approaches typically operate at a coarse instance granularity, treating the entire problem-solving episode as the atomic unit of storage and retrieval. We empirically demonstrate that instance-level memory suffers from a fundamental granularity mismatch, resulting in misguided retrieval when tasks with similar surface descriptions require distinct reasoning logic at specific stages. To address this, we propose Structurally Aligned Subtask-Level Memory, a method that aligns memory storage, retrieval, and updating with the agent's functional decomposition. Extensive experiments on SWE-bench Verified demonstrate that our method consistently outperforms both vanilla agents and strong instance-level memory baselines across diverse backbones, improving mean Pass@1 over the vanilla agent by +4.7 pp on average (e.g., +6.8 pp on Gemini 2.5 Pro). Performance gains grow with more interaction steps, showing that leveraging past experience benefits long-horizon reasoning in complex software engineering tasks.