Beyond Language Boundaries: Uncovering Programming Language Families for Code Language Models

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work investigates deep linguistic relationships among programming languages and their utility for training and inference of multilingual code large language models (ML-Code LLMs). To address the lack of systematic cross-language syntactic characterization, we propose the first comprehensive 21-dimensional syntactic feature taxonomy and construct a syntax-aligned cross-lingual code embedding space. Hierarchical clustering on this space reveals a Go-centered, hierarchical programming language family structure. Leveraging this insight, we design three novel strategies: (i) syntax-aware transfer learning, (ii) language-proximity-driven curriculum learning, and (iii) centroid-guided intermediate translation modeling. Evaluated on four code intelligence tasks—including code completion, translation, summarization, and defect detection—our approach consistently improves ML-Code LLM performance. Results demonstrate that the empirically derived language family structure serves as an effective inductive bias, offering a new paradigm for multilingual code modeling grounded in formal linguistic principles.

Technology Category

Application Category

📝 Abstract

The rapid proliferation of diverse programming languages presents both opportunities and challenges for developing multilingual code LLMs. While existing techniques often train code LLMs by simply aggregating multilingual code data, few explore the deeper relationships between programming languages(PLs) and how such relationships can be utilized to optimize the training and inference of code LLMs. In this work, we investigate 2 fundamental questions: 1) What are the deep linguistic relationships among PLs? and 2) How can these relationships be leveraged to improve multilingual code LLMs? We propose an embedding-based framework to uncover the latent families of PLs. Our approach begins by defining 21 primary linguistic features of programming languages, such as variable definition, control structures, and method declarations, and then employs LLMs to generate feature-aligned code samples across multiple languages. By embedding these semantically parallel code snippets from 19 languages, we construct a similarity matrix and perform hierarchical clustering to uncover inherent language relationships. Our analysis reveals clear hierarchical structures among programming languages. Closely related languages form well-defined clusters (e.g., C, C++, Java, and Swift group together), while Go exhibits as a central language with the highest cross-language similarity. Building on the uncovered language families, we propose three strategies to enhance multilingual LLM training: transfer learning across linguistically related languages, linguistic proximity-guided curriculum learning, and centroid-based intermediary code translation. Experiments on 4 code intelligence tasks demonstrate that our methods significantly improve multilingual LLM performance. This work offers a universal perspective on programming languages and advances more effective strategies for multilingual code LLM training.

Problem

Research questions and friction points this paper is trying to address.

Uncover latent families of programming languages via linguistic features

Leverage language relationships to optimize multilingual code LLM training

Enhance LLM performance on code tasks using family-based strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embedding framework clusters languages by linguistic features

Hierarchical clustering reveals inherent programming language families

Leverages language families for transfer learning and curriculum training

🔎 Similar Papers

No similar papers found.