Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

📅 2025-04-28
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Pretrained code models often suffer from performance degradation due to outdated or semantically inconsistent human-written annotations. To address this, we propose replacing manual annotations with high-quality, LLM-generated ones and systematically reconstruct the CodeSearchNet pretraining dataset. We introduce a reference-free evaluation paradigm—comprising code-comment inconsistency detection and semantic code search—to rigorously assess annotation quality; this is the first empirical demonstration that LLM-generated annotations surpass human annotations in both semantic consistency and task adaptability. Retraining CodeT5 on the reconstructed dataset yields consistent improvements across code summarization, generation, and translation tasks, outperforming the original baseline. Human evaluation further confirms the superior quality of LLM-generated annotations. Our work establishes a novel paradigm for building dynamic, high-fidelity code pretraining datasets, advancing the scalability and reliability of code intelligence systems.

Technology Category

Application Category

📝 Abstract
Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. Large language models (LLMs) excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-generated ones improves pre-training datasets. Since standard metrics cannot assess reference comment quality, we propose two novel reference-free evaluation tasks: code-comment inconsistency detection and semantic code search. Results show that LLM-generated comments are more semantically consistent with code than human-written ones, as confirmed by manual evaluation. Leveraging this finding, we rebuild the CodeSearchNet dataset with LLM-generated comments and re-pre-train CodeT5. Evaluations demonstrate that models trained on LLM-enhanced data outperform those using original human comments in code summarization, generation, and translation tasks. This work validates rebuilding pre-training datasets with LLMs to advance code intelligence, challenging the traditional reliance on human reference comments.
Problem

Research questions and friction points this paper is trying to address.

Replacing outdated human-written code comments with LLM-generated ones
Evaluating comment quality without reference using novel tasks
Enhancing pre-training datasets for better code intelligence performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replace human comments with LLM-generated ones
Propose reference-free evaluation tasks for quality
Rebuild dataset with LLM comments for training
🔎 Similar Papers
No similar papers found.
K
Kang Yang
College of Computer, National University of Defense Technology, Changsha, China
X
Xinjun Mao
College of Computer, National University of Defense Technology, Changsha, China
Shangwen Wang
Shangwen Wang
National University of Defense Technology
software engineering
Y
Yanlin Wang
Sun Yat-sen University
Tanghaoran Zhang
Tanghaoran Zhang
National University of Defense Technology
software engineering
B
Bo Lin
College of Computer, National University of Defense Technology, Changsha, China
Yihao Qin
Yihao Qin
National University of Defense Technology
Software Engineering
Z
Zhang Zhang
College of Computer, National University of Defense Technology, Changsha, China
Y
Yao Lu
College of Computer, National University of Defense Technology, Changsha, China
Kamal Al-Sabahi
Kamal Al-Sabahi
College of Banking and Financial Studies
Deep LearningNatural Language ProcessingPattern RecognitionData Mining