The Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

📅 2025-08-29

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Large language model (LLM) agents incur high computational cost and reduced efficiency in complex software engineering (SE) tasks due to excessively long input contexts. Method: This work systematically compares observation masking—a lightweight, rule-based context compression strategy—with mainstream LLM-driven summarization approaches. We implement a configurable observation masking mechanism within the SWE-agent framework to selectively filter historical observations and evaluate it across multiple models on the SWE-bench Verified benchmark. Contribution/Results: Observation masking halves context-related inference cost while improving task success rate from 53.8% to 54.8% on Qwen3-Coder 480B—matching or slightly exceeding the performance of LLM summarization baselines. Our findings demonstrate that observation masking offers superior efficiency, robustness, and practicality for context management in SE agents, establishing a more effective paradigm for real-world agent design.

Technology Category

Application Category

📝 Abstract

Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering ( SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these strategies within SWE-agent on SWE-bench Verified across five diverse model configurations. We find that a simple observation-masking strategy halves cost relative to a raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. For example, with Qwen3-Coder 480B, masking improves solve rate from 53.8% (raw agent) to 54.8%, while remaining competitive with summarization at a lower cost. These results suggest that, at least within SWE-agent on SWE-bench Verified, the most effective and efficient context management can be the simplest. We release code and data for reproducibility

Problem

Research questions and friction points this paper is trying to address.

Compares LLM summarization vs simple masking for agent context management

Evaluates cost and performance trade-offs in software engineering agents

Tests strategies on SWE-bench with multiple model configurations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simple observation masking halves cost

Masking matches LLM summarization solve rates

Effective context management can be simplest

🔎 Similar Papers

RLPF: Reinforcement Learning from Prediction Feedback for User Summarization with LLMs