🤖 AI Summary
Existing evaluation methods based solely on test pass rates struggle to assess whether patches generated by large language models (LLMs) adhere to project-specific design constraints—such as architectural guidelines or error-handling policies—often leading to an overestimation of repair quality. This work proposes a “design-aware” paradigm for code repair evaluation, explicitly formalizing implicit design constraints and integrating them into the assessment framework. We construct a benchmark, bench{}, by mining real-world pull requests across six repositories, yielding 495 issues and 1,787 verifiable constraints, and introduce an LLM-based validator to automatically evaluate patch compliance with these constraints. Experiments reveal that more than half of test-passing patches violate design constraints, and functional correctness shows no significant correlation with design adherence, highlighting a critical disconnect and advocating for a shift from single-metric to multidimensional evaluation standards in automated program repair.
📝 Abstract
Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{design-aware issue resolution} and presents \bench{}, a benchmark that makes such implicit design constraints explicit and measurable. \bench{} is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.