A Taxonomy of Programming Languages for Code Generation

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

This study addresses the severe imbalance in the distribution of programming languages across existing code corpora and the absence of a systematic resource-tiering framework. It proposes the first reproducible, four-tier classification system for programming language resource abundance, based on token-level statistics from seven major code corpora, enabling quantitative assessment of 646 languages. The analysis reveals that only 1.9% of languages—classified as high-resource—account for 74.6% of all code tokens, while the combined share of 71.7% of low-resource languages constitutes less than 1.0%. This tiered framework establishes the first standardized benchmark for data curation and evaluation of multilingual code generation models.

Technology Category

Application Category

📝 Abstract

The world's 7,000+ languages vary widely in the availability of resources for NLP, motivating efforts to systematically categorize them by their degree of resourcefulness (Joshi et al., 2020). A similar disparity exists among programming languages (PLs); however, no resource-tier taxonomy has been established for code. As large language models (LLMs) grow increasingly capable of generating code, such a taxonomy becomes essential. To fill this gap, we present the first reproducible PL resource classification, grouping 646 languages into four tiers. We show that only 1.9% of languages (Tier 3, High) account for 74.6% of all tokens in seven major corpora, while 71.7% of languages (Tier 0, Scarce) contribute just 1.0%. Statistical analyses of within-tier inequality, dispersion, and distributional skew confirm that this imbalance is both extreme and systematic. Our results provide a principled framework for dataset curation and tier-aware evaluation of multilingual LLMs.

Problem

Research questions and friction points this paper is trying to address.

programming languages

resource scarcity

code generation

taxonomy

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

programming language taxonomy

resource-tier classification

code generation