An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the critical yet underexplored role of pretraining data quality in code large language models (Code-LLMs), noting that existing filtering methods are primarily designed for general text and lack programming-specific strategies. The study introduces, for the first time, an influence-score-based data selection framework tailored to code pretraining: a validation set derived from downstream programming tasks is used to compute the influence score of each training example based on its impact on model loss, enabling targeted curation from a 100-billion-token code corpus. A 1-billion-parameter Code-LLM is then trained from scratch on this filtered dataset. Experimental results demonstrate substantial performance gains across diverse programming benchmarks and reveal significant task-dependent variations in the preference for high-quality training data.

Technology Category

Application Category

📝 Abstract

Recent advancements in code large language models (Code-LLMs) have demonstrated remarkable capabilities in resolving programming related tasks. Meanwhile, researchers have recognized that the quality of pre-training data is crucial for improving LLM performance. However, most of the existing research on pre-training data filtering has focused on general datasets, and little attention for programming datasets. In this paper, we aim to address this gap by exploring the effectiveness of a widely used general data filtering technique, i.e., data-influence-score filtering, within the context of programming-related datasets. To this end, we first introduce a method for calculating data-influence-score for generative programming tasks which involves transforming a variety of downstream coding tasks into validation sets and using the models loss on these sets as a performance metric. Next, we pre-train a Code-LLMs with 1 billion parameters from scratch on a dataset of 100 billion code tokens. Based on it, we conduct an extensive empirical study to evaluate the effectiveness of data-influence-score filtering methods. Specifically, we examine how well this technique improves model performance, investigate how the characteristics of beneficial training data vary across different training stages and programming tasks, and assess the feasibility of prediction-based data-influence-score filtering method. Our findings show that data-influence-score filtering based on validation-set-loss can enhance models programming performance. Moreover, we observe that the criteria of beneficial training data differ significantly across various downstream programming tasks.

Problem

Research questions and friction points this paper is trying to address.

pretraining data selection

code large language models

data filtering

programming datasets

data influence

Innovation

Methods, ideas, or system contributions that make the work stand out.

data influence score

code large language models

pretraining data selection