Scaling Laws for Code: A More Data-Hungry Regime

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical question of whether existing natural-language-based LLM scaling laws—such as Chinchilla—are applicable to code generation. Through a systematic empirical investigation comprising 117 training runs spanning model sizes from 0.2B to 3.8B parameters and dataset sizes from 2B to 128B tokens—including both code and natural language data—we discover, for the first time, that code models operate under a more “data-hungry” training regime: their optimal data-to-parameter ratio is significantly higher than that observed for natural language, and the Farseer scaling law achieves superior fit accuracy and generalization in the code domain compared to Chinchilla. The core contribution is the establishment of the first large-scale empirically validated, code-specific scaling law—providing reproducible, quantitative guidance for efficient training of large code language models.

Technology Category

Application Category

📝 Abstract
Code Large Language Models (LLMs) are revolutionizing software engineering. However, scaling laws that guide the efficient training are predominantly analyzed on Natural Language (NL). Given the fundamental differences like strict syntax between code and NL, it is unclear whether these laws are directly applicable to code. To address this gap, we conduct the first large-scale empirical study of scaling laws for code, comprising 117 experimental runs with model sizes from 0.2B to 3.8B and training tokens from 2B to 128B. We fit the Chinchilla law and the Farsser law. First, the results show that the more expressive Farseer law offers greater accuracy. Second, the analysis reveals that Code LLMs scale effectively with model size. Crucially, code represents a more data-hungry regime, requiring a substantially higher data-to-parameter ratio than NL. Finally, two additional sets of experiments on code-NL mixtures show that NL benefits resource-constrained scenarios, but becomes a detriment at higher compute budgets.
Problem

Research questions and friction points this paper is trying to address.

Investigating scaling laws applicability to code LLMs
Determining optimal data-to-parameter ratio for code training
Analyzing effects of code-NL mixtures on model performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirically studied scaling laws for code
Farseer law offers greater accuracy than Chinchilla
Code requires higher data-to-parameter ratio than NL