RubberDuckBench: A Benchmark for AI Coding Assistants

📅 2026-01-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks struggle to effectively evaluate the question-answering capabilities of AI programming assistants within real-world code contexts. This work proposes the first multilingual, fine-grained code question-answering benchmark constructed from GitHub Pull Request comments, incorporating authentic code contexts, a human-designed scoring rubric, and a systematic evaluation of 20 leading large language models. The benchmark reveals significant limitations in current state-of-the-art models—including Grok 4, Claude Opus 4, and GPT-5—particularly in consistency, correctness, and hallucination. Overall accuracy remains below 70%, with only a handful of responses deemed fully correct; on average, 58.3% of answers exhibit hallucinations. Notably, model performance shows no significant correlation with inference cost.

Technology Category

Application Category

📝 Abstract
Programmers are turning to AI coding assistants to answer questions about their code. Benchmarks are needed to soundly evaluate these systems and understand their performance. To enable such a study, we curate a benchmark of real-world contextualized questions derived from Github pull request comments. Out of this work, we present RubberDuckBench: a multilingual benchmark of questions about code, along with detailed rubrics for evaluating answers. We evaluate a diverse set of 20 LLMs (proprietary&open-source) on answering these questions. We find that even state of the art models fail to give consistent, correct responses across the benchmark. Grok 4 (69.29%), Claude Opus 4 (68.5%), and GPT-5 (67.8%) perform best overall, but do not exhibit pairwise significant superiority over the next 9 best performing models. Most models obtain points through partial credit, with the best performing models only answering at most 2 questions completely correctly across all trials. Furthermore, models often hallucinate with lies in 58.3\% of responses on average. Cost analysis reveals no correlation between expense (API pricing or parameter count) and performance. We intend this benchmark to be a target for future research in trustworthy and correct AI coding assistants.
Problem

Research questions and friction points this paper is trying to address.

AI coding assistants
benchmark
code understanding
hallucination
evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

benchmark
AI coding assistants
code question answering
hallucination
multilingual evaluation
🔎 Similar Papers
No similar papers found.
F
Ferida Mohammad
Bryn Mawr College
F
Fatma Ayad
Bryn Mawr College
Petros Maniatis
Petros Maniatis
Staff Research Scientist, Google
Distributed SystemsSecurityFault ToleranceMachine Learning
S
Satish Chandra
Meta Platforms
E
Elizabeth Dinella
Bryn Mawr College