Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the challenge that existing code question-answering benchmarks struggle to distinguish whether models genuinely understand code logic or merely rely on memorized documentation or superficial patterns from pretraining. To this end, the authors propose an automated framework featuring an “answer-first” task generation mechanism and a three-condition evaluation paradigm—closed-book, code-only, and with documentation—to construct the first repository-scale code QA benchmark that explicitly disentangles code reasoning from document memorization. The framework employs tool-augmented agents to explore source code and generate verifiable questions, with responses evaluated by large language model judges along three dimensions: accuracy, completeness, and specificity. Experiments across 10 Python repositories yield 628 tasks, revealing that code access is the primary driver of performance gains (+0.23), documentation provides only marginal improvement (+0.071), and in code-derivable tasks, the code-only condition nearly matches full-documentation performance.

📝 Abstract

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization. We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.

Problem

Research questions and friction points this paper is trying to address.

code reasoning

documentation memorization

repository-level QA

code comprehension

benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

code reasoning

documentation memorization

answer-first generation