RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

It remains unclear whether current code agents genuinely possess cross-file contextual reasoning capabilities for repository-scale programming tasks. This work proposes RepoMirage, a benchmarking framework that generates challenging reasoning samples through semantics-preserving, repository-level perturbations and introduces structured subtasks to explicitly evaluate agents’ understanding of project structure. Building upon SWE-Bench Verified, we establish a two-stage evaluation protocol that integrates perturbation analysis with execution trajectory tracing, thereby revealing—for the first time—significant deficiencies in existing agents’ contextual reasoning and a phenomenon we term “exploration drift.” We further propose RepoAnchor, a structure-prioritized workflow, and demonstrate experimentally that it substantially improves task success rates from 25.3% on perturbed instances, validating the efficacy of explicit structural guidance in enhancing code agents’ contextual reasoning abilities.

📝 Abstract

Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the ability to identify the task-relevant information across multiple files and reason over the relations among them. To investigate this question, we introduce RepoMirage, a two-stage evaluation suite built on SWE-Bench Verified that adopts perturbation as a diagnostic tool to increase the demand for context reasoning by transforming how the repository is exposed. First, RepoMirage-Perturb applies three types of semantics-preserving repository-level perturbations, revealing a clear performance drop when correct solving requires broader context access. RepoMirage-Extend further turns perturbation-targeted structural bottlenecks into explicit tasks beyond issue resolution, where the average performance declines from 66.8% in the original setting to 25.3%, indicating a significant deficiency in repository context reasoning. Further trajectory analysis reveals an exploration drift, where agents access broader repository context but fail to turn it into effective structure information. Motivated by this observation, we propose RepoAnchor, a structure-first prototype workflow that separates repository exploration from downstream problem solving, and show that explicit structural scaffolding yields notable gains. These results uncover an previously overlooked gap in repository context reasoning for code agents and suggest that stronger structure-aware methods are potential to improve them.

Problem

Research questions and friction points this paper is trying to address.

repository context reasoning

code agents

perturbation

software engineering benchmarks

context awareness

Innovation

Methods, ideas, or system contributions that make the work stand out.

repository context reasoning

code agents

perturbation-based evaluation