Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing LLM evaluation benchmarks emphasize local, short-horizon reasoning and fail to capture the long-range reasoning required for system-level code repair—e.g., fixing cross-function dependencies—while manually constructing high-difficulty tasks is costly and poorly scalable. Method: We propose the first scalable, automated benchmark for system-level code repair. It leverages call-graph analysis and cyclomatic complexity to design a two-dimensional (centrality + coupling) controllable difficulty generation mechanism; adversarial perturbations applied to real-world repository functions automatically produce large-scale, human-annotation-free repair tasks. Contribution/Results: Evaluated on 900+ tasks, state-of-the-art models exhibit a sharp performance drop—from 55% success rate on easiest tasks to 0% on hardest—demonstrating significantly enhanced capability discrimination. Our benchmark establishes a new paradigm for evaluating long-range software reasoning.

Technology Category

Application Category

📝 Abstract

Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-horizon system-level reasoning in LLM code agents

Automatically generating scalable code-repair tasks for benchmarking

Controlling task difficulty via code complexity and system-level dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically generates code-repair tasks adversarially

Controls task difficulty via code complexity metrics

Scales to arbitrary difficulty with varied benchmarks

🔎 Similar Papers

No similar papers found.