🤖 AI Summary
To address semantic drift, reasoning degradation, and context explosion in long-horizon software engineering agents operating over ultra-large-scale codebases—stemming from uncontrolled context growth—this paper introduces the “Context-as-Tool” (CAT) paradigm, which explicitly models context management as callable, learnable tools. Methodologically: (1) we construct a structured workspace that decouples high-fidelity short-term interactions from compressed long-term memory; (2) we design CAT-GENERATOR, a trajectory-level supervised framework enabling milestone-driven proactive compression; and (3) we develop SWE-Compressor, a context-aware compression model. Evaluated on SWE-Bench-Verified, our approach achieves a 57.6% task success rate—significantly outperforming ReAct baselines and static compression methods—while ensuring robustness and scalability of long-range reasoning under fixed context budgets.
📝 Abstract
Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose CAT, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. CAT formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CAT-GENERATOR, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.