🤖 AI Summary
This study addresses the challenge of memory compression for coding agents in scientific discovery tasks, where fixed context windows constrain long-term reasoning. The authors present the first systematic evaluation of eight memory compression strategies across diverse scientific domains, conducting 480 experiments using GPT-4o on 60 tasks from six domains in DiscoveryBench. Their findings reveal that memory compression methods do not significantly affect hypothesis quality; however, LLM-generated summaries incur 24–94% additional token overhead, whereas masking tool-call outputs yields a net token saving of 8.6%. Crucially, the optimal compression strategy is highly dependent on both the specific scientific domain and task length. These results provide empirical grounding and practical guidance for memory management in long-horizon autonomous scientific exploration.
📝 Abstract
Coding agents accumulate extensive context during long-running tasks, yet fixed context windows force practitioners to choose between truncation and task failure. While numerous memory condensation strategies have been proposed, from simple sliding windows to LLM-generated summaries, no systematic comparison exists to guide strategy selection, especially in scientific discovery tasks. We evaluate eight memory condensation strategies using GPT-4o on sixty DiscoveryBench tasks spanning six scientific domains (480 total evaluations). We find that no condenser significantly alters hypothesis quality, while LLM-based condensers increase token costs by 24-94 percent, and masking tool-call outputs achieves an 8.6 percent net savings. We also observe that the optimal condenser for data-driven scientific discovery varies by scientific domain and task length.