CodeSimpleQA: Scaling Factuality in Code Large Language Models

📅 2025-12-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code benchmarks prioritize execution correctness over factual accuracy of programming knowledge, leading large language models (LLMs) to generate factually incorrect responses in code-related question answering. Method: We introduce CodeSimpleQA, the first bilingual factuality evaluation benchmark for code, covering authentic knowledge dimensions—including programming concepts, APIs, and language features—and propose a factuality alignment framework integrating supervised fine-tuning (SFT) and PPO-based reinforcement learning, trained jointly on 66M instruction instances (CodeSimpleQA-Instruct). Contribution/Results: Empirical analysis reveals systematic factual deficiencies across mainstream code LMs; our method significantly improves base model accuracy on CodeSimpleQA. This work fills a critical gap in code-domain factuality assessment and establishes both a new benchmark and a novel alignment paradigm for developing trustworthy code LMs.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have made significant strides in code generation, achieving impressive capabilities in synthesizing code snippets from natural language instructions. However, a critical challenge remains in ensuring LLMs generate factually accurate responses about programming concepts, technical implementations, etc. Most previous code-related benchmarks focus on code execution correctness, overlooking the factual accuracy of programming knowledge. To address this gap, we present CodeSimpleQA, a comprehensive bilingual benchmark designed to evaluate the factual accuracy of code LLMs in answering code-related questions, which contains carefully curated question-answer pairs in both English and Chinese, covering diverse programming languages and major computer science domains. Further, we create CodeSimpleQA-Instruct, a large-scale instruction corpus with 66M samples, and develop a post-training framework combining supervised fine-tuning and reinforcement learning. Our comprehensive evaluation of diverse LLMs reveals that even frontier LLMs struggle with code factuality. Our proposed framework demonstrates substantial improvements over the base model, underscoring the critical importance of factuality-aware alignment in developing reliable code LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating factual accuracy of code LLMs
Addressing lack of factual benchmarks in programming
Improving factuality via instruction tuning and alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bilingual benchmark for code factuality evaluation
Large-scale instruction corpus with 66M samples
Post-training framework combining fine-tuning and reinforcement learning
🔎 Similar Papers
No similar papers found.
J
Jian Yang
Beihang University
W
Wei Zhang
Beihang University
Yizhi Li
Yizhi Li
University of Manchester, M-A-P
LLMReasoningPost-trainingComputational Music
S
Shawn Guo
Beihang University
H
Haowen Wang
Beihang University
A
Aishan Liu
Beihang University
G
Ge Zhang
M-A-P
Zili Wang
Zili Wang
StepFun LLM Researcher & M-A-P
Large Language ModelsCode Intelligence
Zhoujun Li
Zhoujun Li
Beihang University
Artificial IntelligentNatural Language ProcessingNetwork Security
X
Xianglong Liu
Beihang University
W
Weifeng Lv
Beihang University