Scaling Laws for Code: Every Programming Language Matters

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the undermodeling of linguistic heterogeneity in multilingual code large language models (LLMs). We propose the first scaling law for multilingual code LLMs that explicitly accounts for language-specific proportions. Based on over 1,000 controlled experiments (equivalent to 336,000 H800 GPU hours), we systematically characterize the nonlinear and non-uniform impact of language composition on model performance: interpreted languages (e.g., Python) benefit disproportionately from scale-up, while syntactically similar language pairs exhibit significant synergistic gains. Building on these insights, we introduce two techniques: (i) cross-lingual token allocation optimization and (ii) parallel code translation pair augmentation. Empirical evaluation shows that, compared to uniform token allocation, our approach substantially improves average performance across languages under fixed compute budgets—yielding pronounced gains for high-value languages (e.g., Python) and reduced resource consumption for fast-saturating ones (e.g., Rust), thereby enhancing overall generalization.

Technology Category

Application Category

📝 Abstract

Code large language models (Code LLMs) are powerful but costly to train, with scaling laws predicting performance from model size, data, and compute. However, different programming languages (PLs) have varying impacts during pre-training that significantly affect base model performance, leading to inaccurate performance prediction. Besides, existing works focus on language-agnostic settings, neglecting the inherently multilingual nature of modern software development. Therefore, it is first necessary to investigate the scaling laws of different PLs, and then consider their mutual influences to arrive at the final multilingual scaling law. In this paper, we present the first systematic exploration of scaling laws for multilingual code pre-training, conducting over 1000+ experiments (Equivalent to 336,000+ H800 hours) across multiple PLs, model sizes (0.2B to 14B parameters), and dataset sizes (1T tokens). We establish comprehensive scaling laws for code LLMs across multiple PLs, revealing that interpreted languages (e.g., Python) benefit more from increased model size and data than compiled languages (e.g., Rust). The study demonstrates that multilingual pre-training provides synergistic benefits, particularly between syntactically similar PLs. Further, the pre-training strategy of the parallel pairing (concatenating code snippets with their translations) significantly enhances cross-lingual abilities with favorable scaling properties. Finally, a proportion-dependent multilingual scaling law is proposed to optimally allocate training tokens by prioritizing high-utility PLs (e.g., Python), balancing high-synergy pairs (e.g., JavaScript-TypeScript), and reducing allocation to fast-saturating languages (Rust), achieving superior average performance across all PLs compared to uniform distribution under the same compute budget.

Problem

Research questions and friction points this paper is trying to address.

Investigates scaling laws for multilingual code LLMs across programming languages.

Examines how different PLs impact pre-training and model performance predictions.

Optimizes token allocation for multilingual pre-training to enhance overall performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Establishes scaling laws for multilingual code pre-training across languages

Reveals interpreted languages benefit more from scaling than compiled languages

Proposes proportion-dependent token allocation for optimal multilingual performance

🔎 Similar Papers

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

2024-09-21arXiv.orgCitations: 8

Authors to Follow