🤖 AI Summary
Existing code question-answering (CQA) benchmarks focus on isolated, small-scale code snippets and thus fail to capture real-world challenges such as cross-file navigation, architectural comprehension, and long-range dependency reasoning in large software repositories. Method: We introduce SWE-QA, the first CQA benchmark grounded in real-world software repositories, comprising 576 high-quality questions spanning complex scenarios including cross-file inference and multi-hop dependency analysis. We establish the first systematic, repository-level CQA taxonomy, curating data from GitHub Issues and performing rigorous human annotation. Furthermore, we propose SWE-QA-Agent—a novel agent framework integrating multi-step reasoning and tool-augmented execution for automated answer generation. Contribution/Results: Experiments across six state-of-the-art large language models reveal SWE-QA’s effectiveness in exposing critical limitations of current LLMs—particularly in long-range dependency understanding—while SWE-QA-Agent achieves significant performance gains. SWE-QA thus provides both a rigorous new evaluation benchmark and a methodological paradigm for large-scale code understanding research.
📝 Abstract
Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.