LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based agents lack systematic evaluation on critical capabilities—such as multi-turn interaction, tool utilization, and adaptive reasoning—required for complex software development tasks. Method: We introduce LoCoBench-Agent, the first interactive benchmark framework tailored for long-context (10K–1M tokens) software engineering, supporting multi-turn dialogue, eight categories of domain-specific tool calls, and coordinated file operations and search. Contribution/Results: We design nine cross-dimensional metrics, revealing—for the first time—a negative correlation between comprehension depth and execution efficiency, while empirically validating that effective tool-selection strategies are pivotal to performance gains. Experiments show that mainstream models exhibit strong robustness to long contexts, yet vary significantly in dialogue efficiency and tool-use policies. LoCoBench-Agent establishes a reproducible, quantifiable evaluation benchmark and actionable improvement pathways for autonomous software development agents.

Technology Category

Application Category

📝 Abstract
As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like LoCoBench~cite{qiu2025locobench} assess long-context code understanding, they focus on single-turn evaluation and cannot capture the multi-turn interactive nature, tool usage patterns, and adaptive reasoning required by real-world coding agents. We introduce extbf{LoCoBench-Agent}, a comprehensive evaluation framework specifically designed to assess LLM agents in realistic, long-context software engineering workflows. Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations, tool usage efficiency, error recovery, and architectural consistency across extended development sessions. We also introduce an evaluation methodology with 9 metrics across comprehension and efficiency dimensions. Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens, enabling precise assessment of long-context performance. Through systematic evaluation of state-of-the-art models, we reveal several key findings: (1) agents exhibit remarkable long-context robustness; (2) comprehension-efficiency trade-off exists with negative correlation, where thorough exploration increases comprehension but reduces efficiency; and (3) conversation efficiency varies dramatically across models, with strategic tool usage patterns differentiating high-performing agents. As the first long-context LLM agent benchmark for software engineering, LoCoBench-Agent establishes a rigorous foundation for measuring agent capabilities, identifying performance gaps, and advancing autonomous software development at scale.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLM agents in realistic long-context software engineering workflows
Assesses multi-turn interactions and tool usage in coding tasks
Measures agent performance across varying context lengths up to 1M tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends static scenarios into interactive agent environments
Introduces evaluation methodology with nine specialized metrics
Provides agents with eight specialized software engineering tools