🤖 AI Summary
This work challenges the necessity of architectural complexity in language model agents, investigating whether long-context language models (LCLMs) alone can suffice for high-difficulty software engineering tasks—such as those in SWE-bench—without multi-step retrieval, multi-agent coordination, or custom scaffolding. We propose a “zero-tool, zero-scaffolding” paradigm: directly concatenating the full environment context into the model’s input, augmented by task-specific prompt engineering and dual-model collaboration (Gemini-1.5-Pro/Claude-3.7). Our experiments critically question the prevailing “more complex is better” design heuristic. Results show Gemini-1.5-Pro achieves 38.0% SWE-Bench-Verified pass rate—outperforming state-of-the-art complex baselines by 6 percentage points; Gemini-2.5-Pro reaches 50.8%, and the dual-model variant attains 48.6%. These findings demonstrate that minimalist, context-centric architectures can deliver strong, competitive performance on real-world, complex software engineering tasks.
📝 Abstract
Recent advances in language model (LM) agents have demonstrated significant potential for automating complex real-world tasks. To make progress on these difficult tasks, LM agent architectures have become increasingly complex, often incorporating multi-step retrieval tools, multiple agents, and scaffolding adapted to the underlying LM. In this work, we investigate whether all of this complexity is necessary, or if parts of these scaffolds can be removed on challenging tasks like SWE-bench. We show that in the case of SWE-bench, simply putting the entire environment into the context of a long context language model (LCLM) and properly prompting the model makes it competitive with carefully tuned, complex agent scaffolds. We show that a Gemini-1.5-Pro model without any scaffolding or tools achieves 38% on SWE-Bench-Verified, comparable with approaches using carefully tuned agent scaffolds (32%). While the unscaffolded approach with Gemini-1.5-Pro falls short of the strongest agentic architectures, we demonstrate that the more capable Gemini-2.5-Pro using the same unscaffolded approach directly attains a 50.8% solve rate. Additionally, a two-stage approach combining Gemini-1.5-Pro with Claude-3.7 achieves a competitive 48.6% solve rate.