A Taxonomy of Programming Languages for Code Generation

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the severe imbalance in the distribution of programming languages across existing code corpora and the absence of a systematic resource-tiering framework. It proposes the first reproducible, four-tier classification system for programming language resource abundance, based on token-level statistics from seven major code corpora, enabling quantitative assessment of 646 languages. The analysis reveals that only 1.9% of languages—classified as high-resource—account for 74.6% of all code tokens, while the combined share of 71.7% of low-resource languages constitutes less than 1.0%. This tiered framework establishes the first standardized benchmark for data curation and evaluation of multilingual code generation models.
📝 Abstract
The world's 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.
Problem

Research questions and friction points this paper is trying to address.

programming languages
resource scarcity
code generation
taxonomy
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

programming language taxonomy
resource-tier classification
code generation
multilingual LLMs
dataset curation
🔎 Similar Papers
No similar papers found.