🤖 AI Summary
This work addresses the challenge that large language models often introduce disruptive modifications when adding new features across multiple files due to a lack of system-level architectural understanding. To mitigate this, the authors propose the RAIM framework, which first constructs a code graph to enable multi-round target localization, then generates diverse implementation candidates, and finally evaluates their system-wide impact through a combination of static and dynamic analysis to select the highest-quality patch that maintains architectural consistency and minimizes side effects. By moving beyond conventional single-path generation paradigms, RAIM achieves a 39.47% success rate on the NoCode-bench Verified dataset, outperforming the strongest baseline by 36.34% and marking the first instance where an open-source model surpasses closed-source counterparts in this task.
📝 Abstract
Implementing new features across an entire codebase presents a formidable challenge for Large Language Models (LLMs). This proactive task requires a deep understanding of the global system architecture to prevent unintended disruptions to legacy functionalities. Conventional pipeline and agentic frameworks often fall short in this area because they suffer from architectural blindness and rely on greedy single-path code generation. To overcome these limitations, we propose RAIM, a multi-design and architecture-aware framework for repository-level feature addition. This framework introduces a localization mechanism that conducts multi-round explorations over a repository-scale code graph to accurately pinpoint dispersed cross-file modification targets. Crucially, RAIM shifts away from linear patching by generating multiple diverse implementation designs. The system then employs a rigorous impact-aware selection process based on static and dynamic analysis to choose the most architecturally sound patch and avoid system regressions. Comprehensive experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance with a 39.47% success rate, achieving a 36.34% relative improvement over the strongest baseline. Furthermore, the approach exhibits robust generalization across various foundation models and empowers open-weight models like DeepSeek-v3.2 to surpass baseline systems powered by leading proprietary models. Detailed ablation studies confirm that the multi-design generation and impact validation modules are critical to effectively managing complex dependencies and reducing code errors. These findings highlight the vital role of structural awareness in automated software evolution.