🤖 AI Summary
This work addresses the critical challenge in automated program repair of accurately and efficiently retrieving files relevant to a given bug. The authors propose a hybrid retrieval approach that integrates semantic search based on the problem description with file matching derived from historically similar issues. By employing a dual-path recall and reranking mechanism, the method significantly improves recall of relevant files while maintaining a controlled retrieval scope. Innovatively combining the semantics of the current issue with historical repair experience, the approach demonstrates strong effectiveness on the SWE-Bench benchmark, yielding a high-quality, low-noise candidate file set that effectively supports downstream repair processes.
📝 Abstract
Retrieving the correct set of files from a large codebase is a crucial step in Automated Program Repair (APR). High recall is necessary to ensure that the relevant files are included, but simply increasing the number of retrieved files introduces noise and degrades efficiency. To address this tradeoff, we propose PatchRecall, a hybrid retrieval approach that balances recall with conciseness. Our method combines two complementary strategies: (1) codebase retrieval, where the current issue description is matched against the codebase to surface potentially relevant files, and (2) history-based retrieval, where similar past issues are leveraged to identify edited files as candidate targets. Candidate files from both strategies are merged and reranked to produce the final retrieval set. Experiments on SWE-Bench demonstrate that PatchRecall achieves higher recall without significantly increasing retrieved file count, enabling more effective APR.