🤖 AI Summary
Existing LLM evaluation benchmarks emphasize local, short-horizon reasoning and fail to capture the long-range reasoning required for system-level code repair—e.g., fixing cross-function dependencies—while manually constructing high-difficulty tasks is costly and poorly scalable.
Method: We propose the first scalable, automated benchmark for system-level code repair. It leverages call-graph analysis and cyclomatic complexity to design a two-dimensional (centrality + coupling) controllable difficulty generation mechanism; adversarial perturbations applied to real-world repository functions automatically produce large-scale, human-annotation-free repair tasks.
Contribution/Results: Evaluated on 900+ tasks, state-of-the-art models exhibit a sharp performance drop—from 55% success rate on easiest tasks to 0% on hardest—demonstrating significantly enhanced capability discrimination. Our benchmark establishes a new paradigm for evaluating long-range software reasoning.
📝 Abstract
Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning. Existing long-horizon suites (e.g. SWE-bench) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate. However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve. We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two clear dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions). In experiments across more than 900 generated tasks we demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.