Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

Existing evaluation methods based solely on test pass rates struggle to assess whether patches generated by large language models (LLMs) adhere to project-specific design constraints—such as architectural guidelines or error-handling policies—often leading to an overestimation of repair quality. This work proposes a “design-aware” paradigm for code repair evaluation, explicitly formalizing implicit design constraints and integrating them into the assessment framework. We construct a benchmark, bench{}, by mining real-world pull requests across six repositories, yielding 495 issues and 1,787 verifiable constraints, and introduce an LLM-based validator to automatically evaluate patch compliance with these constraints. Experiments reveal that more than half of test-passing patches violate design constraints, and functional correctness shows no significant correlation with design adherence, highlighting a critical disconnect and advocating for a shift from single-metric to multidimensional evaluation standards in automated program repair.

Technology Category

Application Category

📝 Abstract

Repository-level issue resolution benchmarks have become a standard testbed for evaluating LLM-based agents, yet success is still predominantly measured by test pass rates. In practice, however, acceptable patches must also comply with project-specific design constraints, such as architectural conventions, error-handling policies, and maintainability requirements, which are rarely encoded in tests and are often documented only implicitly in code review discussions. This paper introduces \textit{design-aware issue resolution} and presents \bench{}, a benchmark that makes such implicit design constraints explicit and measurable. \bench{} is constructed by mining and validating design constraints from real-world pull requests, linking them to issue instances, and automatically checking patch compliance using an LLM-based verifier, yielding 495 issues and 1,787 validated constraints across six repositories, aligned with SWE-bench-Verified and SWE-bench-Pro. Experiments with state-of-the-art agents show that test-based correctness substantially overestimates patch quality: fewer than half of resolved issues are fully design-satisfying, design violations are widespread, and functional correctness exhibits negligible statistical association with design satisfaction. While providing issue-specific design guidance reduces violations, substantial non-compliance remains, highlighting a fundamental gap in current agent capabilities and motivating design-aware evaluation beyond functional correctness.

Problem

Research questions and friction points this paper is trying to address.

design constraints

issue resolution

LLM-based agents

patch compliance

software maintenance

Innovation

Methods, ideas, or system contributions that make the work stand out.

design-aware issue resolution

design constraint compliance

LLM-based agent evaluation