SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing code understanding benchmarks, which predominantly rely on popular repositories and overlook long-tail topics, thereby inflating model performance. To remedy this, we introduce SWE-QA-Pro—an executable evaluation benchmark spanning diverse and long-tail code repositories—and propose a two-stage training framework: first applying supervised fine-tuning (SFT), followed by reinforcement learning from AI feedback (RLAIF). This framework incorporates question-driven clustering to balance topic coverage and calibrate difficulty through filtering, ensuring a challenging and representative assessment. Additionally, we design a scalable synthetic data generation pipeline that substantially enhances small models’ agent-like reasoning and tool usage capabilities on complex coding tasks. Using this approach, Qwen3-8B outperforms GPT-4o by 2.3 percentage points on SWE-QA-Pro, significantly narrowing the gap with state-of-the-art closed-source models.

Technology Category

Application Category

📝 Abstract
Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.
Problem

Research questions and friction points this paper is trying to address.

repository-level code understanding
benchmark
long-tail topics
agentic reasoning
evaluation gap
Innovation

Methods, ideas, or system contributions that make the work stand out.

repository-level code understanding
agentic reasoning
synthetic data generation
RLAIF
benchmark design
🔎 Similar Papers
No similar papers found.