🤖 AI Summary
Existing benchmarks struggle to effectively evaluate the question-answering capabilities of AI programming assistants within real-world code contexts. This work proposes the first multilingual, fine-grained code question-answering benchmark constructed from GitHub Pull Request comments, incorporating authentic code contexts, a human-designed scoring rubric, and a systematic evaluation of 20 leading large language models. The benchmark reveals significant limitations in current state-of-the-art models—including Grok 4, Claude Opus 4, and GPT-5—particularly in consistency, correctness, and hallucination. Overall accuracy remains below 70%, with only a handful of responses deemed fully correct; on average, 58.3% of answers exhibit hallucinations. Notably, model performance shows no significant correlation with inference cost.
📝 Abstract
Programmers are turning to AI coding assistants to answer questions about their code. Benchmarks are needed to soundly evaluate these systems and understand their performance. To enable such a study, we curate a benchmark of real-world contextualized questions derived from Github pull request comments. Out of this work, we present RubberDuckBench: a multilingual benchmark of questions about code, along with detailed rubrics for evaluating answers. We evaluate a diverse set of 20 LLMs (proprietary&open-source) on answering these questions. We find that even state of the art models fail to give consistent, correct responses across the benchmark. Grok 4 (69.29%), Claude Opus 4 (68.5%), and GPT-5 (67.8%) perform best overall, but do not exhibit pairwise significant superiority over the next 9 best performing models. Most models obtain points through partial credit, with the best performing models only answering at most 2 questions completely correctly across all trials. Furthermore, models often hallucinate with lies in 58.3\% of responses on average. Cost analysis reveals no correlation between expense (API pricing or parameter count) and performance. We intend this benchmark to be a target for future research in trustworthy and correct AI coding assistants.